This kernel is for all aspiring data scientists to learn from and to review their knowledge. We will have a detailed statistical analysis of Titanic data set along with Machine learning model implementation. I am super excited to share my first kernel with the Kaggle community. As I go on in this journey and learn new topics, I will incorporate them with each new updates. So, check for them and please leave a comment if you have any suggestions to make this kernel better!! Going back to the topics of this kernel, I will do more in-depth visualizations to explain the data, and the machine learning classifiers will be used to predict passenger survival status.
NOTE:
This is a julia translation
If you are reading this on github, I recommend you read this on kaggle
Follow me on github:
Kernel Goals
There are three primary goals of this kernel.
Do a statistical analysis of how some group of people was survived more than others.
Do an exploratory data analysis(EDA) of titanic with visualizations and storytelling.
Predict: Use machine learning classification models to predict the chances of passengers survival.
P.S. If you want to learn more about regression models, try this kernel.
Part 1: Importing Necessary Libraries and datasets
1a. Loading libraries
Python is a fantastic language with a vibrant community that produces many amazing libraries. I am not a big fan of importing everything at once for the newcomers. So, I am going to introduce a few necessary libraries for now, and as we go on, we will keep unboxing new libraries when it seems appropriate.
Activating project at `C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook`
Resolving package versions...
No Changes to `C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\Project.toml`
No Changes to `C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\Manifest.toml`
importDataFrames as DFimportCSVimportCairoMakie as MakieimportAlgebraOfGraphics as AoGimportStatistics as StatsimportStatsBaseimportChain: @chainimportRandom: shuffleimportIJulia
After loading the necessary modules, we need to import the datasets. Many of the business problems usually come with a tremendous amount of messy data. We extract those data from many sources. I am hoping to write about that in a different kernel. For now, we are going to work with a less complicated and quite popular machine learning dataset.
## Importing the datasetsusingCSVtrain = CSV.read("./input/train.csv", DF.DataFrame)test = CSV.read("./input/test.csv", DF.DataFrame);
You are probably wondering why two datasets? Also, Why have I named it “train” and “test”? To explain that I am going to give you an overall picture of the supervised machine learning process.
“Machine Learning” is simply “Machine” and “Learning”. Nothing more and nothing less. In a supervised machine learning process, we are giving machine/computer/models specific inputs or data(text/number/image/audio) to learn from aka we are training the machine to learn certain aspects based on the data and the output. Now, how can we determine that machine is actually learning what we are try to teach? That is where the test set comes to play. We withhold part of the data where we know the output/result of each datapoints, and we use this data to test the trained models. We then compare the outcomes to determine the performance of the algorithms. If you are a bit confused thats okay. I will explain more as we keep reading. Let’s take a look at sample datasets.
This is a sample of train and test dataset. Lets find out a bit more about the train and test dataset.
println("The shape of the train data is (row, column): $(size(train))")println("Train dataset info:")DF.describe(train)println("The shape of the test data is (row, column): $(size(test))")println("Test dataset info:")DF.describe(test)
The shape of the train data is (row, column): (891, 12)
Train dataset info:
The shape of the test data is (row, column): (418, 11)
Test dataset info:
11×7 DataFrame
Row
variable
mean
min
median
max
nmissing
eltype
Symbol
Union…
Any
Union…
Any
Int64
Type
1
PassengerId
1100.5
892
1100.5
1309
0
Int64
2
Pclass
2.26555
1
3.0
3
0
Int64
3
Name
Abbott, Master. Eugene Joseph
van Billiard, Master. Walter John
0
String
4
Sex
female
male
0
String7
5
Age
30.2726
0.17
27.0
76.0
86
Union{Missing, Float64}
6
SibSp
0.447368
0
0.0
8
0
Int64
7
Parch
0.392344
0
0.0
9
0
Int64
8
Ticket
110469
W.E.P. 5734
0
String31
9
Fare
35.6272
0.0
14.4542
512.329
1
Union{Missing, Float64}
10
Cabin
A11
G6
327
Union{Missing, String15}
11
Embarked
C
S
0
String1
1d. About This Dataset
The data has split into two groups:
training set (train.csv)
test set (test.csv)
The training set includes our target variable(dependent variable), passenger survival status (also known as the ground truth from the Titanic tragedy) along with other independent features like gender, class, fare, and Pclass.
The test set should be used to see how well our model performs on unseen data. When we say unseen data, we mean that the algorithm or machine learning models have no relation to the test data. We do not want to use any part of the test data in any way to modify our algorithms; Which are the reasons why we clean our test data and train data separately. The test set does not provide passengers survival status. We are going to use our model to predict passenger survival status.
Now let’s go through the features and describe a little. There is a couple of different type of variables, They are…
Categorical:
Nominal(variables that have two or more categories, but which do not have an intrinsic order.) > - Cabin > - Embarked(Port of Embarkation) C(Cherbourg) Q(Queenstown) S(Southampton)
Dichotomous(Nominal variable with only two categories) > - Sex Female Male
Ordinal(variables that have two or more categories just like nominal variables. Only the categories can also be ordered or ranked.) > - Pclass (A proxy for socio-economic status (SES)) 1(Upper) 2(Middle) 3(Lower)
Numeric:
Discrete > - Passenger ID(Unique identifing # for each passenger) > - SibSp > - Parch > - Survived (Our outcome or dependent variable) 0 1
Continous > - Age > - Fare
Text Variable
Ticket (Ticket number for passenger.)
Name( Name of the passenger.)
1e. Tableau Visualization of the Data
I have incorporated a tableau visualization below of the training data. This visualization…
is for us to have an overview and play around with the dataset.
is done without making any changes(including Null values) to any features of the dataset.
Let’s get a better perspective of the dataset through this visualization.
We want to see how the left vertical bar changes when we filter out unique values of certain features. We can use multiple filters to see if there are any correlations among them. For example, if we click on upper and Female tab, we would see that green color dominates the bar with a ratio of 91:3 survived and non survived female passengers; a 97% survival rate for females. We can reset the filters by clicking anywhere in the whilte space. The age distribution chart on top provides us with some more info such as, what was the age range of those three unlucky females as the red color give away the unsurvived once. If you would like to check out some of my other tableau charts, please click here.
Part 2: Overview and Cleaning the Data
2a. Overview
Datasets in the real world are often messy, However, this dataset is almost clean. Lets analyze and see what we have here.
It looks like, the features have unequal amount of data entries for every column and they have many different types of variables. This can happen for the following reasons…
functionmissing_percentage(df::DF.DataFrame)"""This function takes a DataFrame as input and returns total missing values and percentages""" missing_counts = [count(ismissing, df[!, col]) for col in DF.names(df)] missing_pct =round.(missing_counts ./ DF.nrow(df) .*100, digits=2)# Create result DataFrame result = DF.DataFrame( Column = DF.names(df), Total = missing_counts, Percent = missing_pct )# Sort by total missing values (descending)return DF.sort(result, :Total, rev=true)end
missing_percentage (generic function with 1 method)
functionpercent_value_counts(df::DF.DataFrame, feature::Symbol)"""This function takes a dataframe and a column and finds the percentage of the value_counts"""# Count values including missing counts = DF.combine(DF.groupby(df, feature), DF.nrow =>:Total)# Calculate percentages counts.Percent =round.(counts.Total ./ DF.nrow(df) .*100, digits=2)# Sort by total count (descending)return DF.sort(counts, :Total, rev=true)end
percent_value_counts (generic function with 1 method)
It looks like there are only two null values( ~ 0.22 %) in the Embarked feature, we can replace these with the mode value “S”. However, let’s dig a little deeper.
We may be able to solve these two missing values by looking at other independent variables of the two raws. Both passengers paid a fare of $80, are of Pclass 1 and female Sex. Let’s see how the Fare is distributed among all Pclass and Embarked feature values
fig = Makie.Figure()# Prepare data for plottingtrain_clean = DF.dropmissing(train, [:Embarked, :Fare, :Pclass])test_clean = DF.dropmissing(test, [:Embarked, :Fare, :Pclass])# Create mapping for embarked ports to numbersunique_categories =unique(train_clean.Embarked)category_to_index =Dict(category => i for (i, category) inenumerate(unique_categories))# Convert categorical to numerictrain_clean.Embarked_num = [category_to_index[port] for port in train_clean.Embarked]test_clean.Embarked_num = [category_to_index[port] for port in test_clean.Embarked]# Training set boxplotax1 = Makie.Axis(fig[1, 1], title ="Training Set", xlabel ="Embarked", ylabel ="Fare", xticks = (1:3, unique_categories))ax2 = Makie.Axis(fig[1, 2], title ="Test Set", xlabel ="Embarked", ylabel ="Fare", xticks = (1:3, unique_categories))Makie.boxplot!(ax2, test_clean.Embarked_num, test_clean.Fare, dodge = test_clean.Pclass, color = test_clean.Pclass)Makie.boxplot!(ax1, train_clean.Embarked_num, train_clean.Fare, dodge = train_clean.Pclass, color = train_clean.Pclass)fig
:::
Here, in both training set and test set, the average fare closest to $80 are in the C Embarked values where pclass is 1. So, let’s fill in the missing values as “C”
Train Cabin missing: 0.7710437710437711
Test Cabin missing: 0.7822966507177034
:::
Approximately 77% of Cabin feature is missing in the training data and 78% missing on the test data. We have two choices,
we can either get rid of the whole feature, or
we can brainstorm a little and find an appropriate way to put them in use. For example, We may say passengers with cabin record had a higher socio-economic-status then others. We may also say passengers with cabin record were more likely to be taken into consideration when loading into the boat.
Let’s combine train and test data first and for now, will assign all the null values as “N”
All the cabin names start with an English alphabet following by multiple digits. It seems like there are some passengers that had booked multiple cabin rooms in their name. This is because many of them travelled with family. However, they all seem to book under the same letter followed by different numbers. It seems like there is a significance with the letters rather than the numbers. Therefore, we can group these cabins according to the letter of the cabin name.
So, We still haven’t done any effective work to replace the null values. Let’s stop for a second here and think through how we can take advantage of some of the other features here.
We can use the average of the fare column We can use pythons groupby function to get the mean fare of each cabin letter.
@chain all_data begin DF.dropmissing(:Fare) DF.groupby(:Cabin) DF.combine(:Fare => Stats.mean =>:Mean_Fare) DF.sort(:Mean_Fare)end
9×2 DataFrame
Row
Cabin
Mean_Fare
String
Float64
1
G
14.205
2
F
18.0794
3
N
19.1327
4
T
35.5
5
A
41.2443
6
D
53.0073
7
E
54.5646
8
C
107.927
9
B
122.383
:::
Now, these means can help us determine the unknown cabins, if we compare each unknown cabin rows with the given mean’s above. Let’s write a simple function so that we can give cabin names based on the means.
functioncabin_estimator(fare::Union{Float64, Missing})"""Grouping cabin feature by the first letter based on fare"""# Handle missing valuesifismissing(fare)return"N"# Default cabin for missing fareendif fare <16return"G"elseif16≤ fare <27return"F"elseif27≤ fare <38return"T"elseif38≤ fare <47return"A"elseif47≤ fare <53return"E"elseif53≤ fare <54return"D"elseif54≤ fare <116return"C"elsereturn"B"endend
cabin_estimator (generic function with 1 method)
:::
Let’s apply cabin_estimator function in each unknown cabins(cabin with null values). Once that is done we will separate our train and test to continue towards machine learning modeling.
with_N.Cabin =cabin_estimator.(with_N.Fare)# Combine back togetherall_data =vcat(with_N, without_N)# Sort by PassengerIdDF.sort!(all_data, :PassengerId)# Separate train and testtrain = all_data[1:891, :]test = all_data[892:end, :]# Add back survival informationtrain.Survived = survivors;
:::
Fare Feature
If you have paid attention so far, you know that there is only one missing value in the fare column. Let’s have it.
print("test")test[ismissing.(test.Fare), :]
test
1×11 DataFrame
Row
PassengerId
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
Int64
Int64
String
String7
Float64?
Int64
Int64
String31
Float64?
String
Abstract…
1
1044
3
Storey, Mr. Thomas
male
60.5
0
0
3701
missing
N
S
Here, We can take the average of the Fare column to fill in the NaN value. However, for the sake of learning and practicing, we will try something else. We can take the average of the values wherePclass is 3, Sex is male and Embarked is S
missing_value =@chain test begin DF.subset(:Pclass => x -> x .==3, :Embarked => x -> x .=="S", :Sex => x -> x .=="male") _.Fare skipmissing Stats.meanend# Replace missing faretest.Fare =coalesce.(test.Fare, missing_value);
:::
Age Feature
We know that the feature “Age” is the one with most missing values, let’s see it in terms of percentage.
println("Train age missing value: $(round(count(ismissing, train.Age) / DF.nrow(train) *100, digits=2))%")println("Test age missing value: $(round(count(ismissing, test.Age) / DF.nrow(test) *100, digits=2))%")
Train age missing value: 19.87%
Test age missing value: 20.57%
:::
We will take a different approach since ~20% data in the Age column is missing in both train and test dataset. The age variable seems to be promising for determining survival rate. Therefore, It would be unwise to replace the missing values with median, mean or mode. We will use machine learning model Random Forest Regressor to impute missing value instead of Null value. We will keep the age column unchanged for now and work on that in the feature engineering section.
Part 3. Visualization and Feature Relations
Before we dive into finding relations between independent variables and our dependent variable(survivor), let us create some assumptions about how the relations may turn-out among features.
Assumptions:
Gender: More female survived than male
Pclass: Higher socio-economic status passenger survived more than others.
Age: Younger passenger survived more than other passengers.
Fare: Passenger with higher fare survived more that other passengers. This can be quite correlated with Pclass.
Now, let’s see how the features are related to each other by creating some visualizations.
This bar plot above shows the distribution of female and male survived. The x_label represents Sex feature while the y_label represents the % of passenger survived. This bar plot shows that ~74% female passenger survived while only ~19% male passenger survived.
fig = Makie.Figure()ax = Makie.Axis(fig[1, 1], title ="Passenger Gender Distribution - Survived vs Not-survived", xlabel ="Sex", ylabel ="# of Passenger Survived", xticks = (1:2, ["Male", "Female"]))# Count data for grouped bar chartcount_data =@chain train begin DF.groupby([:Sex, :Survived]) DF.combine(DF.nrow =>:count) DF.unstack(:Survived, :count, fill=0)end# Create grouped bar chartcounts = [count_data[1, 2], count_data[1, 3], count_data[2, 2], count_data[2, 3]]Makie.barplot!(ax, [1, 1, 2, 2], counts, dodge = [1, 2, 1,2], color = ["gray", "green", "gray", "green"])# Add legendMakie.Legend(fig[1, 2], [Makie.PolyElement(color ="gray"), Makie.PolyElement(color ="green")], ["Not Survived", "Survived"],"Survival Status")fig
This count plot shows the actual distribution of male and female passengers that survived and did not survive. It shows that among all the females ~ 230 survived and ~ 70 did not survive. While among male passengers ~110 survived and ~480 did not survive.
Summary
As we suspected, female passengers have survived at a much better rate than male passengers.
It seems about right since females and children were the priority.
Makie.barplot([1, 2, 3], survived_percentage, axis=(xticks=(1:3, ["1st Class", "2nd Class", "3rd Class"]), title ="Passenger Class Distribution - Survived vs Non-Survived"),)
It looks like …
~ 63% first class passenger survived titanic tragedy, while
~ 48% second class and
~ only 24% third class passenger survived.
fig = Makie.Figure( title ="Passenger Class Distribution - Survived vs Non-Survived", xlabel ="Passenger Class", ylabel ="Density of Passenger Survived",) # Adjust figure size as neededax = Makie.Axis(fig[1, 1], xticks = ([1, 2, 3], ["Upper", "Middle", "Lower"])) not_survived = train.Pclass[train.Survived .==0]survived = train.Pclass[train.Survived .==1]d1 = Makie.density!(ax, train.Pclass[train.Survived .==0], color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)d2= Makie.density!(ax, train.Pclass[train.Survived .==1], color = (:green, 0.2), strokecolor=:green, strokewidth=2)Makie.axislegend(ax, [d1, d2], ["Not Survived", "Survived"],"Survival Status")fig
This KDE plot is pretty self-explanatory with all the labels and colors. Something I have noticed that some readers might find questionable is that the lower class passengers have survived more than second-class passengers. It is true since there were a lot more third-class passengers than first and second.
Summary
The first class passengers had the upper hand during the tragedy. You can probably agree with me more on this, in the next section of visualizations where we look at the distribution of ticket fare and survived column.
3c. Fare and Survived
fig = Makie.Figure()ax = Makie.Axis(fig[1, 1], title ="Fare Distribution - Survived vs Non-Survived", xlabel ="Fare", ylabel ="Density of Passenger Survived",)not_survived = train.Fare[train.Survived .==0]survived = train.Fare[train.Survived .==1]d1 = Makie.density!(ax, not_survived, color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)d2 = Makie.density!(ax, survived, color = (:green, 0.2), strokecolor=:green, strokewidth=2)Makie.axislegend(ax, [d1, d2], ["Not Survived", "Survived"],"Survival Status")fig
This plot shows something impressive..
The spike in the plot under 100 dollar represents that a lot of passengers who bought the ticket within that range did not survive.
When fare is approximately more than 280 dollars, there is no gray shade which means, either everyone passed that fare point survived or maybe there is an outlier that clouds our judgment. Let’s check…
train[train.Fare .>280, :]
3×12 DataFrame
Row
PassengerId
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
Survived
Int64
Int64
String
String7
Float64?
Int64
Int64
String31
Float64?
String
Abstract…
Int64
1
259
1
Ward, Miss. Anna
female
35.0
0
0
PC 17755
512.329
B
C
1
2
680
1
Cardeza, Mr. Thomas Drake Martinez
male
36.0
0
1
PC 17755
512.329
B
C
1
3
738
1
Lesurer, Mr. Gustave J
male
35.0
0
0
PC 17755
512.329
B
C
1
As we assumed, it looks like an outlier with a fare of $512. We sure can delete this point. However, we will keep it for now.
3d. Age and Survived
fig = Makie.Figure()ax = Makie.Axis(fig[1, 1], title ="Age Distribution - Survived vs Non-Survived", xlabel ="Age", ylabel ="Density of Passenger Survived")# clean missing firstclean_train = DF.dropmissing(train, :Age)not_survived = clean_train.Age[clean_train.Survived .==0]survived = clean_train.Age[clean_train.Survived .==1]d1 = Makie.density!(ax, not_survived, color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)d2 = Makie.density!(ax, survived, color = (:green, 0.2), strokecolor=:green, strokewidth=2)Makie.axislegend(ax, [d1, d2], ["Not Survived", "Survived"],"Survival Status")fig
There is nothing out of the ordinary about this plot, except the very left part of the distribution. This may hint on the posibility that children and infants were the priority.
3e. Combined Feature Relations
In this section, we are going to discover more than two feature relations in a single graph. I will try my best to illustrate most of the feature relations. Let’s get to it.
fig8 = Makie.Figure(title="Survived by Sex and Age")# Create subplots for each combinationfor (i, (sex, survived)) inenumerate(Iterators.product(["female", "male"], [0, 1])) ax = Makie.Axis(fig8[div(i -1, 2) +1, i %2+1], title ="$sex$survived", xlabel ="Age", ylabel ="Count" ) subset_data = train[(train.Sex .== sex) .& (train.Survived .== survived) .& .!ismissing.(train.Age), :]if DF.nrow(subset_data) >0 Makie.hist!(ax, subset_data.Age, bins =20, color = survived ==1 ? "green":"gray", strokewidth =1, strokecolor =:white)endendfig8
Facetgrid is a great way to visualize multiple variables and their relationships at once. From the chart in section 3a we have a intuation that female passengers had better prority than males during the tragedy. However, from this facet grid, we can also understand which age range groups survived more than others or were not so lucky
fig8 = Makie.Figure(title="Survived by Sex and Age")# Create subplots for each combinationfor (i, (sex, embarked)) inenumerate(Iterators.product(["female", "male"], ["S", "C", "Q"])) ax = Makie.Axis(fig8[div(i -1, 2) +1, i %2+1], title ="$sex$embarked", ) subset_data = train[(train.Sex .== sex) .& (train.Embarked .== embarked) .& .!ismissing.(train.Age), :]for (survived) in [0, 1] subset_survived = subset_data[(subset_data.Survived .== survived), :]println("Length of subset: $(DF.nrow(subset_survived))")if DF.nrow(subset_data) >0 Makie.hist!(ax, subset_survived.Age, bins =20, color = survived ==1 ? (:green, 0.5) : (:gray, 0.5), strokewidth =1, strokecolor =:white, label = survived ==1 ? "Survived":"Not Survived" )endendendMakie.Legend(fig8[1, 3], [Makie.PolyElement(color = (:gray, 0.7)), Makie.PolyElement(color = (:green, 0.7))], ["Not Survived", "Survived"],"Survival Status")fig8
Length of subset: 53
Length of subset: 133
Length of subset: 300
Length of subset: 68
Length of subset: 6
Length of subset: 57
Length of subset: 45
Length of subset: 24
Length of subset: 5
Length of subset: 7
Length of subset: 15
Length of subset: 1
This is another compelling facet grid illustrating four features relationship at once. They are Embarked, Age, Survived & Sex.
The color illustrates passengers survival status(green represents survived, gray represents not survived)
The column represents Sex(left being male, right stands for female)
The row represents Embarked(from top to bottom: S, C, Q)
Now that I have steered out the apparent let’s see if we can get some insights that are not so obvious as we look at the data.
Most passengers seem to be boarded on Southampton(S).
More than 60% of the passengers died boarded on Southampton.
More than 60% of the passengers lived boarded on Cherbourg(C).
Pretty much every male that boarded on Queenstown(Q) did not survive.
There were very few females boarded on Queenstown, however, most of them survived.
fig9 = Makie.Figure(resolution = (1000, 600))ax9_m = Makie.Axis(fig9[1, 1], title ="Male", xlabel ="Fare", ylabel ="Age")# Female subplotax9_f = Makie.Axis(fig9[1, 2], title ="Female", xlabel ="Fare", ylabel ="Age")female_data = train[(train.Sex .=="female") .& .!ismissing.(train.Age), :]male_data = train[(train.Sex .=="male") .& .!ismissing.(train.Age), :]Makie.scatter!(ax9_m, male_data.Fare, male_data.Age, color = [s ==1 ? "green":"gray" for s in male_data.Survived], strokewidth=1, strokecolor="white", markersize=14)Makie.scatter!(ax9_f, female_data.Fare, female_data.Age, color = [s ==1 ? "green":"gray" for s in female_data.Survived], strokewidth=1, strokecolor="white", markersize=14)# Add legendMakie.Legend(fig9[1, 3], [Makie.MarkerElement(color ="gray", marker =:circle), Makie.MarkerElement(color ="green", marker =:circle)], ["Not Survived", "Survived"],"Survived")Makie.Label(fig9[0, :], "Survived by Sex, Fare and Age")fig9
┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
└ @ Makie C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\scenes.jl:259
This facet grid unveils a couple of interesting insights. Let’s find out.
The grid above clearly demonstrates the three outliers with Fare of over $500. At this point, I think we are quite confident that these outliers should be deleted.
Most of the passengers were with in the Fare range of $100.
## get the most important variables.corr = train.corr()**2corr.Survived.sort_values(ascending=False)
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#1143:20\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1143:20]8;;\
## get the most important variables.
corr = train.corr()**2
# └┘ ── use `x^y` instead of `x**y` for exponentiation, and `x...` instead of `**x` for splatting
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1143
:::
Squaring the correlation feature not only gives on positive correlations but also amplifies the relationships.
importnumpy as npmask = np.zeros_like(train.corr(), dtype=np.bool)mask[np.triu_indices_from(mask)] = Truesns.set_style('whitegrid')plt.subplots(figsize = (15,12))sns.heatmap(train.corr(), annot=True, mask = mask, cmap ='RdBu', ## in order to reverse the bar replace "RdBu" with "RdBu_r" linewidths=.9, linecolor='white', fmt='.2g', center =0, square=True)plt.title("Correlations Among Features", y =1.03,fontsize =20, pad =40);
So, Let’s analyze these correlations a bit. We have found some moderately strong relationships between different features. There is a definite positive correlation between Fare and Survived rated. This relationship reveals that the passenger who paid more money for their ticket were more likely to survive. This theory aligns with one other correlation which is the correlation between Fare and Pclass(-0.6). This relationship can be explained by saying that first class passenger(1) paid more for fare then second class passenger(2), similarly second class passenger paid more than the third class passenger(3). This theory can also be supported by mentioning another Pclass correlation with our dependent variable, Survived. The correlation between Pclass and Survived is -0.33. This can also be explained by saying that first class passenger had a better chance of surviving than the second or the third and so on.
However, the most significant correlation with our dependent variable is the Sex variable, which is the info on whether the passenger was male or female. This negative correlation with a magnitude of -0.54 which points towards some undeniable insights. Let’s do some statistics to see how statistically significant this correlation is.
4b. Statistical Test for Correlation
Statistical tests are the scientific way to prove the validation of theories. In any case, when we look at the data, we seem to have an intuitive understanding of where data is leading us. However, when we do statistical tests, we get a scientific or mathematical perspective of how significant these results are. Let’s apply some of these methods and see how we are doing with our predictions.
Hypothesis Testing Outline
A hypothesis test compares the mean of a control group and experimental group and tries to find out whether the two sample means are different from each other and if they are different, how significant that difference is.
A hypothesis test usually consists of multiple parts:
Formulate a well-developed research problem or question: The hypothesis test usually starts with a concrete and well-developed researched problem. We need to ask the right question that can be answered using statistical analysis.
The null hypothesis(\(H_0\)) and Alternating hypothesis(\(H_1\)): > - The null hypothesis(\(H_0\)) is something that is assumed to be true. It is the status quo. In a null hypothesis, the observations are the result of pure chance. When we set out to experiment, we form the null hypothesis by saying that there is no difference between the means of the control group and the experimental group. > - An Alternative hypothesis(\(H_A\)) is a claim and the opposite of the null hypothesis. It is going against the status quo. In an alternative theory, the observations show a real effect combined with a component of chance variation.
Determine the test statistic: test statistic can be used to assess the truth of the null hypothesis. Depending on the standard deviation we either use t-statistics or z-statistics. In addition to that, we want to identify whether the test is a one-tailed test or two-tailed test. This article explains it pretty well. This article is pretty good as well.
Specify a Significance level and Confidence Interval: The significance level(\(\alpha\)) is the probability of rejecting a null hypothesis when it is true. In other words, we are comfortable/confident with rejecting the null hypothesis a significant amount of times even though it is true. This considerable amount is our Significant level. In addition to that, Significance level is one minus our Confidence interval. For example, if we say, our significance level is 5%, then our confidence interval would be (1 - 0.05) = 0.95 or 95%.
Compute the T-Statistics/Z-Statistics: Computing the t-statistics follows a simple equation. This equation slightly differs depending on one sample test or two sample test
Compute the P-value: P-value is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis is correct. The p-value is known to be unintuitive, and even many professors are known to explain it wrong. I think this video explains the p-value well. The smaller the P-value, the stronger the evidence against the null hypothesis.
Describe the result and compare the p-value with the significance value(\(\alpha\)): If p<=\(\alpha\), then the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid. However if the p> \(\alpha\), we say that, we fail to reject the null hypothesis. Even though this sentence is grammatically wrong, it is logically right. We never accept the null hypothesis just because we are doing the statistical test with sample data points.
We will follow each of these steps above to do your hypothesis testing below.
P.S. Khan Academy has a set of videos that I think are intuative and helped me understand conceptually.
Hypothesis testing for Titanic
Formulating a well developed researched question:
Regarding this dataset, we can formulate the null hypothesis and alternative hypothesis by asking the following questions.
Is there a significant difference in the mean sex between the passenger who survived and passenger who did not survive?.
Is there a substantial difference in the survival rate between the male and female passengers?
The Null Hypothesis and The Alternative Hypothesis:
We can formulate our hypothesis by asking questions differently. However, it is essential to understand what our end goal is. Here our dependent variable or target variable is Survived. Therefore, we say
** Null Hypothesis(\(H_0\)):** There is no difference in the survival rate between the male and female passengers. or the mean difference between male and female passenger in the survival rate is zero.
** Alternative Hypothesis(\(H_A\)):** There is a difference in the survival rate between the male and female passengers. or the mean difference in the survival rate between male and female is not zero.
Onc thing we can do is try to set up the Null and Alternative Hypothesis in such way that, when we do our t-test, we can choose to do one tailed test. According to this article, one-tailed tests are more powerful than two-tailed test. In addition to that, this video is also quite helpful understanding these topics. with this in mind we can update/modify our null and alternative hypothesis. Let’s see how we can rewrite this..
Null Hypothesis(H0): male mean is greater or equal to female mean.
Alternative Hypothesis(H1): male mean is less than female mean.
Determine the test statistics:
This will be a two-tailed test since the difference between male and female passenger in the survival rate could be higher or lower than 0. Since we do not know the standard deviation(\(\sigma\)) and n is small, we will use the t-distribution.
Specify the significance level:
Specifying a significance level is an important step of the hypothesis test. It is an ultimate balance between type 1 error and type 2 error. We will discuss more in-depth about those in another lesson. For now, we have decided to make our significance level(\(\alpha\)) = 0.05. So, our confidence interval or non-rejection region would be (1 - \(\alpha\))=(1-0.05) = 95%.
Computing T-statistics and P-value:
Let’s take a random sample and see the difference.
Now, we have to understand that those two means are not the population mean (\(\bar{\mu}\)). The population mean is a statistical term statistician uses to indicate the actual average of the entire group. The group can be any gathering of multiple numbers such as animal, human, plants, money, stocks. For example, To find the age population mean of Bulgaria; we will have to account for every single person’s age and take their age. Which is almost impossible and if we were to go that route; there is no point of doing statistics in the first place. Therefore we approach this problem using sample sets. The idea of using sample set is that; if we take multiple samples of the same population and take the mean of them and put them in a distribution; eventually the distribution start to look more like a normal distribution. The more samples we take and the more sample means will be added and, the closer the normal distribution will reach towards population mean. This is where Central limit theory comes from. We will go more in depth of this topic later on.
Going back to our dataset, like we are saying these means above are part of the whole story. We were given part of the data to train our machine learning models, and the other part of the data was held back for testing. Therefore, It is impossible for us at this point to know the population means of survival for male and females. Situation like this calls for a statistical approach. We will use the sampling distribution approach to do the test. let’s take 50 random sample of male and female from our train data.
# separating male and female dataframe.importrandommale = train[train['Sex'] ==1]female = train[train['Sex'] ==0]## empty list for storing mean samplem_mean_samples = []f_mean_samples = []for i inrange(50): m_mean_samples.append(np.mean(random.sample(list(male['Survived']),50,))) f_mean_samples.append(np.mean(random.sample(list(female['Survived']),50,)))# Print them outprint (f"Male mean sample mean: {round(np.mean(m_mean_samples),2)}")print (f"Male mean sample mean: {round(np.mean(f_mean_samples),2)}")print (f"Difference between male and female mean sample mean: {round(np.mean(f_mean_samples) - np.mean(m_mean_samples),2)}")
H0: male mean is greater or equal to female mean H1: male mean is less than female mean.
According to the samples our male samples (\(\bar{x}_m\)) and female samples(\(\bar{x}_f\)) mean measured difference is ~ 0.55(statistically this is called the point estimate of the male population mean and female population mean). keeping in mind that…
We randomly select 50 people to be in the male group and 50 people to be in the female group.
We know our sample is selected from a broader population(trainning set).
We know we could have totally ended up with a different random sample of males and females.
With all three points above in mind, how confident are we that, the measured difference is real or statistically significant? we can perform a t-test to evaluate that. When we perform a t-test we are usually trying to find out an evidence of significant difference between population mean with hypothesized mean(1 sample t-test) or in our case difference between two population means(2 sample t-test).
The t-statistics is the measure of a degree to which our groups differ standardized by the variance of our measurements. In order words, it is basically the measure of signal over noise. Let us describe the previous sentence a bit more for clarification. I am going to use this post as reference to describe the t-statistics here.
Calculating the t-statistics
\[t = \frac{\bar{x}-\mu}{\frac{S} {\sqrt{n}} }\]
Here..
\(\bar{x}\) is the sample mean.
\(\mu\) is the hypothesized mean.
S is the standard deviation.
n is the sample size.
Now, the denominator of this fraction \((\bar{x}-\mu)\) is basically the strength of the signal. where we calculate the difference between hypothesized mean and sample mean. If the mean difference is higher, then the signal is stronger.
the numerator of this fraction ** \({S}/ {\sqrt{n}}\) ** calculates the amount of variation or noise of the data set. Here S is standard deviation, which tells us how much variation is there in the data. n is the sample size.
So, according to the explanation above, the t-value or t-statistics is basically measures the strength of the signal(the difference) to the amount of noise(the variation) in the data and that is how we calculate the t-value in one sample t-test. However, in order to calculate between two sample population mean or in our case we will use the follow equation.
This equation may seem too complex, however, the idea behind these two are similar. Both of them have the concept of signal/noise. The only difference is that we replace our hypothesis mean with another sample mean and the two sample sizes repalce one sample size.
Here..
\(\bar{x}_M\) is the mean of our male group sample measurements.
$ {x}_F$ is the mean of female group samples.
$ n_M$ and \(n_F\) are the sample number of observations in each group.
$ S^2$ is the sample variance.
It is good to have an understanding of what going on in the background. However, we will use scipy.stats to find the t-statistics.
Compare P-value with \(\alpha\)
It looks like the p-value is very small compared to our significance level(\(\alpha\))of 0.05. Our observation sample is statistically significant. Therefore, our null hypothesis is ruled out, and our alternative hypothesis is valid, which is “There is a significant difference in the survival rate between the male and female passengers.”
Part 5: Feature Engineering
Feature Engineering is exactly what its sounds like. Sometimes we want to create extra features from with in the features that we have, sometimes we want to remove features that are alike. Features engineering is the simple word for doing all those. It is important to remember that we will create new features in such ways that will not cause multicollinearity(when there is a relationship among independent variables) to occur.
name_length
Creating a new feature “name_length” that will take the count of letters of each name
# Creating a new colomn with atrain['name_length'] = [len(i) for i in train.Name]test['name_length'] = [len(i) for i in test.Name]def name_length_group(size): a =''if (size <=20): a ='short' elif (size <=35): a ='medium' elif (size <=45): a ='good'else: a ='long'return atrain['nLength_group'] = train['name_length'].map(name_length_group)test['nLength_group'] = test['name_length'].map(name_length_group)## Here "map" is python's built-in function.## "map" function basically takes a function and## returns an iterable list/tuple or in this case series.## However,"map" can also be used like map(function) e.g. map(name_length_group)## or map(function, iterable{list, tuple}) e.g. map(name_length_group, train[feature]]).## However, here we don't need to use parameter("size") for name_length_group because when we## used the map function like ".map" with a series before dot, we are basically hinting that series## and the iterable. This is similar to .append approach in python. list.append(a) meaning applying append on list.## cuts the column by given bins based on the range of name_length#group_names = ['short', 'medium', 'good', 'long']#train['name_len_group'] = pd.cut(train['name_length'], bins = 4, labels=group_names)
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#1356:8\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1356:8]8;;\
# Creating a new colomn with a
train['name_length'] = [len(i) for i in train.Name]
# └─────────┘ ── character literal contains multiple characters
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1356
## get the title from the nametrain["title"] = [i.split('.')[0] for i in train.Name]train["title"] = [i.split(',')[1] for i in train.title]## Whenever we split like that, there is a good change that we will end up with while space around our string values. Let's check that.
type String has no field split
Stacktrace:
[1] getproperty @.\Base.jl:49 [inlined]
[2] (::var"#27#28")(i::String) @Main.Notebook.\none:0
[3] iterate @.\generator.jl:48 [inlined]
[4] collect(itr::Base.Generator{Vector{String}, var"#27#28"}) @Base.\array.jl:791
[5] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1400
:::
print(train.title.unique())
ArgumentError: column name :title not found in the data frame
Stacktrace:
[1] lookupname @C:\Users\Fabrizio\.julia\packages\DataFrames\kcA9R\src\other\index.jl:431 [inlined]
[2] getindex @C:\Users\Fabrizio\.julia\packages\DataFrames\kcA9R\src\other\index.jl:440 [inlined]
[3] getindex(df::DataFrames.DataFrame, ::typeof(!), col_ind::Symbol) @DataFramesC:\Users\Fabrizio\.julia\packages\DataFrames\kcA9R\src\dataframe\dataframe.jl:557
[4] getproperty(df::DataFrames.DataFrame, col_ind::Symbol) @DataFramesC:\Users\Fabrizio\.julia\packages\DataFrames\kcA9R\src\abstractdataframe\abstractdataframe.jl:448
[5] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1407
## We can also combile all three lines above for test set heretest['title'] = [i.split('.')[0].split(',')[1].strip() for i in test.Name]## However it is important to be able to write readable code, and the line above is not so readable.
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#1419:7\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1419:7]8;;\
## We can also combile all three lines above for test set here
test['title'] = [i.split('.')[0].split(',')[1].strip() for i in test.Name]
# └───┘ ── character literal contains multiple characters
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1419
## Let's replace some of the rare values with the keyword 'rare' and other word choice of our own.## train Datatrain["title"] = [i.replace('Ms', 'Miss') for i in train.title]train["title"] = [i.replace('Mlle', 'Miss') for i in train.title]train["title"] = [i.replace('Mme', 'Mrs') for i in train.title]train["title"] = [i.replace('Dr', 'rare') for i in train.title]train["title"] = [i.replace('Col', 'rare') for i in train.title]train["title"] = [i.replace('Major', 'rare') for i in train.title]train["title"] = [i.replace('Don', 'rare') for i in train.title]train["title"] = [i.replace('Jonkheer', 'rare') for i in train.title]train["title"] = [i.replace('Sir', 'rare') for i in train.title]train["title"] = [i.replace('Lady', 'rare') for i in train.title]train["title"] = [i.replace('Capt', 'rare') for i in train.title]train["title"] = [i.replace('the Countess', 'rare') for i in train.title]train["title"] = [i.replace('Rev', 'rare') for i in train.title]## Now in programming there is a term called DRY(Don't repeat yourself), whenever we are repeating## same code over and over again, there should be a light-bulb turning on in our head and make us think## to code in a way that is not repeating or dull. Let's write a function to do exactly what we## did in the code above, only not repeating and more interesting.
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#1428:30\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1428:30]8;;\
## train Data
train["title"] = [i.replace('Ms', 'Miss') for i in train.title]
# └┘ ── character literal contains multiple characters
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1428
## we are writing a function that can help us modify title columndef name_converted(feature):""" This function helps modifying the title column """ result =''if feature in ['the Countess','Capt','Lady','Sir','Jonkheer','Don','Major','Col', 'Rev', 'Dona', 'Dr']: result ='rare' elif feature in ['Ms', 'Mlle']: result ='Miss' elif feature =='Mme': result ='Mrs'else: result = featurereturn resulttest.title = test.title.map(name_converted)train.title = train.title.map(name_converted)
UndefVarError: UndefVarError(:def, Main.Notebook)
UndefVarError: `def` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Hint: a global variable of this name may be made accessible by importing IntervalArithmetic in the current active module Main
## bin the family size.def family_group(size):""" This funciton groups(loner, small, large) family based on family size """ a =''if (size <=1): a ='loner' elif (size <=4): a ='small'else: a ='large'return a
UndefVarError: UndefVarError(:def, Main.Notebook)
UndefVarError: `def` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Hint: a global variable of this name may be made accessible by importing IntervalArithmetic in the current active module Main
## apply the family_group function in family_sizetrain['family_group'] = train['family_size'].map(family_group)test['family_group'] = test['family_size'].map(family_group)
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#1519:8\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1519:8]8;;\
## apply the family_group function in family_size
train['family_group'] = train['family_size'].map(family_group)
# └──────────┘ ── character literal contains multiple characters
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1519
type Array has no field value_counts
Stacktrace:
[1] getproperty(x::Vector{String31}, f::Symbol) @Base.\Base.jl:49
[2] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1541
:::
I have yet to figureout how to best manage ticket feature. So, any suggestion would be truly appreciated. For now, I will get rid off the ticket feature.
## Calculating fare based on family size.train['calculated_fare'] = train.Fare/train.family_sizetest['calculated_fare'] = test.Fare/test.family_size
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#1563:8\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1563:8]8;;\
## Calculating fare based on family size.
train['calculated_fare'] = train.Fare/train.family_size
# └─────────────┘ ── character literal contains multiple characters
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1563
:::
Some people have travelled in groups like family or friends. It seems like Fare column kept a record of the total fare rather than the fare of individual passenger, therefore calculated fare will be much handy in this situation.
def fare_group(fare):""" This function creates a fare group based on the fare provided """ a=''if fare <=4: a ='Very_low' elif fare <=10: a ='low' elif fare <=20: a ='mid' elif fare <=45: a ='high'else: a ="very_high"return atrain['fare_group'] = train['calculated_fare'].map(fare_group)test['fare_group'] = test['calculated_fare'].map(fare_group)#train['fare_group'] = pd.cut(train['calculated_fare'], bins = 4, labels=groups)
UndefVarError: UndefVarError(:def, Main.Notebook)
UndefVarError: `def` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Hint: a global variable of this name may be made accessible by importing IntervalArithmetic in the current active module Main
:::
Fare group was calculated based on calculated_fare. This can further help our cause.
PassengerId
It seems like PassengerId column only works as an id in this dataset without any significant effect on the dataset. Let’s drop it.
Dummy variable is an important prepocessing machine learning step. Often times Categorical variables are an important features, which can be the difference between a good model and a great model. While working with a dataset, having meaningful value for example, “male” or “female” instead of 0’s and 1’s is more intuitive for us. However, machines do not understand the value of categorical values, for example, in this dataset we have gender male or female, algorithms do not accept categorical variables as input. In order to feed data in a machine learning model, we
## rearranging the columns so that I can easily use the dataframe to predict the missing age values.train = pd.concat([train[["Survived", "Age", "Sex","SibSp","Parch"]], train.loc[:,"is_alone":]], axis=1)test = pd.concat([test[["Age", "Sex"]], test.loc[:,"SibSp":]], axis=1)
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#1646:94\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1646:94]8;;\
## rearranging the columns so that I can easily use the dataframe to predict the missing age values.
train = pd.concat([train[["Survived", "Age", "Sex","SibSp","Parch"]], train.loc[:,"is_alone":]], axis=1)
# └ ── missing last argument in range expression
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1646
## Importing RandomForestRegressorfrom sklearn.ensemble importRandomForestRegressor## writing a function that takes a dataframe with missing values and outputs it by filling the missing values.def completing_age(df):## gettting all the features except survived age_df = df.loc[:,"Age":] temp_train = age_df.loc[age_df.Age.notnull()] ## df with age values temp_test = age_df.loc[age_df.Age.isnull()] ## df without age values y = temp_train.Age.values ## setting target variables(age) in y x = temp_train.loc[:, "Sex":].values rfr =RandomForestRegressor(n_estimators=1500, n_jobs=-1) rfr.fit(x, y) predicted_age = rfr.predict(temp_test.loc[:, "Sex":]) df.loc[df.Age.isnull(), "Age"] = predicted_agereturn df## Implementing the completing_age function in both train and test dataset.completing_age(train)completing_age(test);
UndefVarError:
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
:::
Let’s take a look at the histogram of the age column.
## Let's look at the hisplt.subplots(figsize = (22,10),)sns.distplot(train.Age, bins =100, kde = True, rug = False, norm_hist=False);
UndefVarError: `plt` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1690
:::
age_group
We can create a new feature by grouping the “Age” column
## create bins for agedef age_group_fun(age):""" This function creates a bin for age """ a =''if age <=1: a ='infant' elif age <=4: a ='toddler' elif age <=13: a ='child' elif age <=18: a ='teenager' elif age <=35: a ='Young_Adult' elif age <=45: a ='adult' elif age <=55: a ='middle_aged' elif age <=65: a ='senior_citizen'else: a ='old'return a## Applying "age_group_fun" function to the "Age" column.train['age_group'] = train['Age'].map(age_group_fun)test['age_group'] = test['Age'].map(age_group_fun)## Creating dummies for "age_group" feature.train = pd.get_dummies(train,columns=['age_group'], drop_first=True)test = pd.get_dummies(test,columns=['age_group'], drop_first=True);
UndefVarError:
UndefVarError: `def` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Hint: a global variable of this name may be made accessible by importing IntervalArithmetic in the current active module Main
:::
Need to paraphrase this section
Feature Selection
Feature selection is an important part of machine learning models. There are many reasons why we use feature selection.
Simple models are easier to interpret. People who acts according to model results have a better understanding of the model.
Shorter training times.
Enhanced generalisation by reducing overfitting.
Easier to implement by software developers> model production.
<ul>
<li>As Data Scientists we need to remember no to creating models with too many variables since it might overwhelm production engineers.</li>
</ul>
<li>Reduced risk of data errors during model use</li>
<li>Data redundancy</li>
Part 6: Pre-Modeling Tasks
6a. Separating dependent and independent variables
Before we apply any machine learning models, It is important to separate dependent and independent variables. Our dependent variable or target variable is something that we are trying to find, and our independent variable is the features we use to find the dependent variable. The way we use machine learning algorithm in a dataset is that we train our machine learning model by specifying independent variables and dependent variable. To specify them, we need to separate them from each other, and the code below does just that.
P.S. In our test dataset, we do not have a dependent variable feature. We are to predict that using machine learning models.
There are multiple ways of splitting data. They are…
train_test_split.
cross_validation.
We have separated dependent and independent features; We have separated train and test data. So, why do we still have to split our training data? If you are curious about that, I have the answer. For this competition, when we train the machine learning algorithms, we use part of the training set usually two-thirds of the train data. Once we train our algorithm using 2/3 of the train data, we start to test our algorithms using the remaining data. If the model performs well we dump our test data in the algorithms to predict and submit the competition. The code below, basically splits the train data into 4 parts, X_train, X_test, y_train, y_test.
X_train and y_train first used to train the algorithm.
then, X_test is used in that trained algorithms to predict outcomes.
Once we get the outcomes, we compare it with y_test
By comparing the outcome of the model with y_test, we can determine whether our algorithms are performing well or not. As we compare we use confusion matrix to determine different aspects of model performance.
P.S. When we use cross validation it is important to remember not to use X_train, X_test, y_train and y_test, rather we will use X and y. I will discuss more on that.
from sklearn.model_selection importtrain_test_splitX_train, X_test, y_train, y_test =train_test_split(X, y,test_size =.33, random_state=0)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
:::
len(X_train)
UndefVarError: `len` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1809
len(X_test)
UndefVarError: `len` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1814
6c. Feature Scaling
Feature scaling is an important concept of machine learning models. Often times a dataset contain features highly varying in magnitude and unit. For some machine learning models, it is not a problem. However, for many other ones, its quite a problem. Many machine learning algorithms uses euclidian distances to calculate the distance between two points, it is quite a problem. Let’s again look at a the sample of the train dataset below.
ArgumentError: column name :sample not found in the data frame
Stacktrace:
[1] lookupname @C:\Users\Fabrizio\.julia\packages\DataFrames\kcA9R\src\other\index.jl:431 [inlined]
[2] getindex @C:\Users\Fabrizio\.julia\packages\DataFrames\kcA9R\src\other\index.jl:440 [inlined]
[3] getindex(df::DataFrames.DataFrame, ::typeof(!), col_ind::Symbol) @DataFramesC:\Users\Fabrizio\.julia\packages\DataFrames\kcA9R\src\dataframe\dataframe.jl:557
[4] getproperty(df::DataFrames.DataFrame, col_ind::Symbol) @DataFramesC:\Users\Fabrizio\.julia\packages\DataFrames\kcA9R\src\abstractdataframe\abstractdataframe.jl:448
[5] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1829
:::
Here Age and Calculated_fare is much higher in magnitude compared to others machine learning features. This can create problems as many machine learning models will get confused thinking Age and Calculated_fare have higher weight than other features. Therefore, we need to do feature scaling to get a better result. There are multiple ways to do feature scaling.
MinMaxScaler-Scales the data using the max and min values so that it fits between 0 and 1.
StandardScaler-Scales the data so that it has mean 0 and variance of 1.
RobustScaler-Scales the data similary to Standard Scaler, but makes use of the median and scales using the interquertile range so as to aviod issues with large outliers.
I will discuss more on that in a different kernel. For now we will use Standard Scaler to feature scale our dataset.
P.S. I am showing a sample of both before and after so that you can see how scaling changes the dataset.
UndefVarError: `X_train` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1850
# Feature Scaling## We will be using standardscaler to transformfrom sklearn.preprocessing importStandardScalerst_scale =StandardScaler()## transforming "train_x"X_train = st_scale.fit_transform(X_train)## transforming "test_x"X_test = st_scale.transform(X_test)## transforming "The testset"#test = st_scale.transform(test)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
UndefVarError: `headers` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1880
:::
You can see how the features have transformed above.
Part 7: Modeling the Data
In the previous versions of this kernel, I thought about explaining each model before applying it. However, this process makes this kernel too lengthy to sit and read at one go. Therefore I have decided to break this kernel down and explain each algorithm in a different kernel and add the links here. If you like to review logistic regression, please click here.
# import LogisticRegression model in python.from sklearn.linear_model importLogisticRegressionfrom sklearn.metrics importmean_absolute_error, accuracy_score## call on the model objectlogreg =LogisticRegression(solver='liblinear', penalty='l1',random_state =42 )## fit the model with "train_x" and "train_y"logreg.fit(X_train,y_train)## Once the model is trained we want to find out how well the model is performing, so we test the model.## we use "X_test" portion of the data(this data was not used to fit the model) to predict model outcome.y_pred = logreg.predict(X_test)## Once predicted we save that outcome in "y_pred" variable.## Then we compare the predicted value( "y_pred") and actual value("test_y") to see how well our model is performing.
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
:::
Evaluating a classification model
There are multiple ways to evaluate a classification model.
Confusion Matrix.
ROC Curve
AUC Curve.
Confusion Matrix
Confusion matrix, a table that describes the performance of a classification model. Confusion Matrix tells us how many our model predicted correctly and incorrectly in terms of binary/multiple outcome classes by comparing actual and predicted cases. For example, in terms of this dataset, our model is a binary one and we are trying to classify whether the passenger survived or not survived. we have fit the model using X_train and y_train and predicted the outcome of X_test in the variable y_pred. So, now we will use a confusion matrix to compare between y_test and y_pred. Let’s do the confusion matrix.
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Our y_test has a total of 294 data points; part of the original train set that we splitted in order to evaluate our model. Each number here represents certain details about our model. If we were to think about this interms of column and raw, we could see that…
the first column is of data points that the machine predicted as not-survived.
the second column is of the statistics that the model predicted as survievd.
In terms of raws, the first raw indexed as “Not-survived” means that the value in that raw are actual statistics of not survived once.
and the “Survived” indexed raw are values that actually survived.
Now you can see that the predicted not-survived and predicted survived sort of overlap with actual survived and actual not-survived. After all it is a matrix and we have some terminologies to call these statistics more specifically. Let’s see what they are
True Positive(TP): values that the model predicted as yes(survived) and is actually yes(survived).
True Negative(TN): values that model predicted as no(not-survived) and is actually no(not-survived)
False Positive(or Type I error): values that model predicted as yes(survived) but actually no(not-survived)
False Negative(or Type II error): values that model predicted as no(not-survived) but actually yes(survived)
For this dataset, whenever the model is predicting something as yes, it means the model is predicting that the passenger survived and for cases when the model predicting no; it means the passenger did not survive. Let’s determine the value of all these terminologies above.
True Positive(TP):87
True Negative(TN):149
False Positive(FP):28
False Negative(FN):30
From these four terminologies, we can compute many other rates that are used to evaluate a binary classifier.
Accuracy:
** Accuracy is the measure of how often the model is correct.**
(TP + TN)/total = (87+149)/294 = .8027
We can also calculate accuracy score using scikit learn.
from sklearn.metrics importaccuracy_scoreaccuracy_score(y_test, y_pred)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Misclassification Rate: Misclassification Rate is the measure of how often the model is wrong**
Misclassification Rate and Accuracy are opposite of each other.
Missclassification is equivalent to 1 minus Accuracy.
Misclassification Rate is also known as “Error Rate”.
(FP + FN)/Total = (28+30)/294 = 0.19
True Positive Rate/Recall/Sensitivity: How often the model predicts yes(survived) when it’s actually yes(survived)?
TP/(TP+FN) = 87/(87+30) = 0.7435897435897436
from sklearn.metrics importrecall_scorerecall_score(y_test, y_pred)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
False Positive Rate: How often the model predicts yes(survived) when it’s actually no(not-survived)?
FP/(FP+TN) = 28/(28+149) = 0.15819209039548024
True Negative Rate/Specificity: How often the model predicts no(not-survived) when it’s actually no(not-survived)?
True Negative Rate is equivalent to 1 minus False Positive Rate.
TN/(TN+FP) = 149/(149+28) = 0.8418079096045198
Precision: How often is it correct when the model predicts yes.
TP/(TP+FP) = 87/(87+28) = 0.7565217391304347
from sklearn.metrics importprecision_scoreprecision_score(y_test, y_pred)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
from sklearn.metrics importclassification_report, balanced_accuracy_scoreprint(classification_report(y_test, y_pred))
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
we have our confusion matrix. How about we give it a little more character.
from sklearn.utils.multiclass importunique_labelsfrom sklearn.metrics importconfusion_matrixdef plot_confusion_matrix(y_true, y_pred, classes, normalize=False, title=None, cmap=plt.cm.Blues):""" This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """if not title:if normalize: title ='Normalized confusion matrix'else: title ='Confusion matrix, without normalization'# Compute confusion matrix cm =confusion_matrix(y_true, y_pred)# Only use the labels that appear in the data classes = classes[unique_labels(y_true, y_pred)]if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]print("Normalized confusion matrix")else:print('Confusion matrix, without normalization')print(cm) fig, ax = plt.subplots()im= ax.imshow(cm, interpolation='nearest', cmap=cmap) ax.figure.colorbar(im, ax=ax)# We want to show all ticks... ax.set(xticks=np.arange(cm.shape[1]), yticks=np.arange(cm.shape[0]),# ... and label them with the respective list entries xticklabels=classes, yticklabels=classes, title=title, ylabel='True label', xlabel='Predicted label')# Rotate the tick labels and set their alignment. plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")# Loop over data dimensions and create text annotations. fmt ='.2f'if normalize else'd' thresh = cm.max() /2.for i inrange(cm.shape[0]):for j inrange(cm.shape[1]): ax.text(j, i, format(cm[i, j], fmt), ha="center", va="center", color="white"if cm[i, j] > thresh else"black") fig.tight_layout()return axnp.set_printoptions(precision=2)class_names = np.array(['not_survived','survived'])# Plot non-normalized confusion matrixplot_confusion_matrix(y_test, y_pred, classes=class_names, title='Confusion matrix, without normalization')# Plot normalized confusion matrixplot_confusion_matrix(y_test, y_pred, classes=class_names, normalize=True, title='Normalized confusion matrix')plt.show()
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
:::
Using Cross-validation:
Pros:
Helps reduce variance.
Expends models predictability.
sc = st_scale
UndefVarError: `st_scale` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2153
## Using StratifiedShuffleSplit## We can use KFold, StratifiedShuffleSplit, StratiriedKFold or ShuffleSplit, They are all close cousins. look at sklearn userguide for more info.from sklearn.model_selection importStratifiedShuffleSplit, cross_val_scorecv =StratifiedShuffleSplit(n_splits =10, test_size =.25, random_state =0 ) # run model 10x with 60/30 split intentionally leaving out 10%## Using standard scale for the whole dataset.## saving the feature names for decision tree displaycolumn_names = X.columnsX = sc.fit_transform(X)accuracies =cross_val_score(LogisticRegression(solver='liblinear'), X,y, cv = cv)print ("Cross-Validation accuracy scores:{}".format(accuracies))print ("Mean Cross-Validation accuracy score: {}".format(round(accuracies.mean(),5)))
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
:::
Grid Search on Logistic Regression
What is grid search?
What are the pros and cons?
Gridsearch is a simple concept but effective technique in Machine Learning. The word GridSearch stands for the fact that we are searching for optimal parameter/parameters over a “grid.” These optimal parameters are also known as Hyperparameters. The Hyperparameters are model parameters that are set before fitting the model and determine the behavior of the model.. For example, when we choose to use linear regression, we may decide to add a penalty to the loss function such as Ridge or Lasso. These penalties require specific alpha (the strength of the regularization technique) to set beforehand. The higher the value of alpha, the more penalty is being added. GridSearch finds the optimal value of alpha among a range of values provided by us, and then we go on and use that optimal value to fit the model and get sweet results. It is essential to understand those model parameters are different from models outcomes, for example, coefficients or model evaluation metrics such as accuracy score or mean squared error are model outcomes and different than hyperparameters.
This part of the kernel is a working progress. Please check back again for future updates.
from sklearn.model_selection importGridSearchCV, StratifiedKFold## C_vals is the alpla value of lasso and ridge regression(as alpha increases the model complexity decreases,)## remember effective alpha scores are 0<alpha<infinityC_vals = [0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,16.5,17,17.5,18]## Choosing penalties(Lasso(l1) or Ridge(l2))penalties = ['l1','l2']## Choose a cross validation strategy.cv =StratifiedShuffleSplit(n_splits =10, test_size =.25)## setting param for param_grid in GridSearchCV.param = {'penalty': penalties, 'C': C_vals}logreg =LogisticRegression(solver='liblinear')## Calling on GridSearchCV object.grid =GridSearchCV(estimator=LogisticRegression(), param_grid = param, scoring ='accuracy', n_jobs =-1, cv = cv )## Fitting the modelgrid.fit(X, y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
## Getting the best of everything.print (grid.best_score_)print (grid.best_params_)print(grid.best_estimator_)
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#2217:6\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2217:6]8;;\
## Getting the best of everything.
print(grid.best_score_)
# ╙ ── whitespace is not allowed here
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2217
### Using the best parameters from the grid-search.logreg_grid = grid.best_estimator_logreg_grid.score(X,y)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2229
:::
This part of the kernel is a working progress. Please check back again for future updates.
So, we have our first model and its score. But, how do we make sure that our model is performing well. Our model may be overfitting or underfitting. In fact, for those of you don’t know what overfitting and underfitting is, Let’s find out.
As you see in the chart above. Underfitting is when the model fails to capture important aspects of the data and therefore introduces more bias and performs poorly. On the other hand, Overfitting is when the model performs too well on the training data but does poorly in the validation set or test sets. This situation is also known as having less bias but more variation and perform poorly as well. Ideally, we want to configure a model that performs well not only in the training data but also in the test data. This is where bias-variance tradeoff comes in. When we have a model that overfits, meaning less biased and more of variance, we introduce some bias in exchange of having much less variance. One particular tactic for this task is regularization models (Ridge, Lasso, Elastic Net). These models are built to deal with the bias-variance tradeoff. This kernel explains this topic well. Also, the following chart gives us a mental picture of where we want our models to be.
Ideally, we want to pick a sweet spot where the model performs well in training set, validation set, and test set. As the model gets complex, bias decreases, variance increases. However, the most critical part is the error rates. We want our models to be at the bottom of that U shape where the error rate is the least. That sweet spot is also known as Optimum Model Complexity(OMC).
Now that we know what we want in terms of under-fitting and over-fitting, let’s talk about how to combat them.
How to combat over-fitting?
Simplify the model by using less parameters.
Simplify the model by changing the hyperparameters.
Introducing regularization models.
Use more training data.
Gatter more data ( and gather better quality data).
#### This part of the kernel is a working progress. Please check back again for future updates.####
## Importing the model.from sklearn.neighbors importKNeighborsClassifier## calling on the model oject.knn =KNeighborsClassifier(metric='minkowski', p=2)## knn classifier works by doing euclidian distance## doing 10 fold staratified-shuffle-split cross validationcv =StratifiedShuffleSplit(n_splits=10, test_size=.25, random_state=2)accuracies =cross_val_score(knn, X,y, cv = cv, scoring='accuracy')print ("Cross-Validation accuracy scores:{}".format(accuracies))print ("Mean Cross-Validation accuracy score: {}".format(round(accuracies.mean(),3)))
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
## Search for an optimal value of k for KNN.k_range =range(1,31)k_scores = []for k in k_range: knn =KNeighborsClassifier(n_neighbors=k) scores =cross_val_score(knn, X,y, cv = cv, scoring ='accuracy') k_scores.append(scores.mean())print("Accuracy scores are: {}\n".format(k_scores))print ("Mean accuracy score: {}".format(np.mean(k_scores)))
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#2295:18\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2295:18]8;;\
k_scores = []
# ┌
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
#──┘ ── line break after `:` in range expression
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2295
from matplotlib importpyplot as pltplt.plot(k_range, k_scores)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
from sklearn.model_selection importGridSearchCV## trying out multiple values for kk_range =range(1,31)##weights_options=['uniform','distance']#param = {'n_neighbors':k_range, 'weights':weights_options}## Using startifiedShufflesplit.cv =StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)# estimator = knn, param_grid = param, n_jobs = -1 to instruct scikit learn to use all available processors.grid =GridSearchCV(KNeighborsClassifier(), param,cv=cv,verbose = False, n_jobs=-1)## Fitting the model.grid.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2333
### Using the best parameters from the grid-search.knn_grid= grid.best_estimator_knn_grid.score(X,y)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2344
:::
Using RandomizedSearchCV
Randomized search is a close cousin of grid search. It doesn’t always provide the best result but its fast.
from sklearn.model_selection importRandomizedSearchCV## trying out multiple values for kk_range =range(1,31)##weights_options=['uniform','distance']#param = {'n_neighbors':k_range, 'weights':weights_options}## Using startifiedShufflesplit.cv =StratifiedShuffleSplit(n_splits=10, test_size=.30)# estimator = knn, param_grid = param, n_jobs = -1 to instruct scikit learn to use all available processors.## for RandomizedSearchCV,grid =RandomizedSearchCV(KNeighborsClassifier(), param,cv=cv,verbose = False, n_jobs=-1, n_iter=40)## Fitting the model.grid.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
### Using the best parameters from the grid-search.knn_ran_grid = grid.best_estimator_knn_ran_grid.score(X,y)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2383
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
from sklearn.svm importSVCCs = [0.001, 0.01, 0.1, 1,1.5,2,2.5,3,4,5, 10] ## penalty parameter C for the error term.gammas = [0.0001,0.001, 0.01, 0.1, 1]param_grid = {'C': Cs, 'gamma': gammas}cv =StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)grid_search =GridSearchCV(SVC(kernel ='rbf', probability=True), param_grid, cv=cv) ## 'rbf' stands for gaussian kernelgrid_search.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
UndefVarError: `grid_search` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2428
# using the best found hyper paremeters to get the score.svm_grid = grid_search.best_estimator_svm_grid.score(X,y)
UndefVarError: `grid_search` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2437
:::
Decision Tree Classifier
Decision tree works by breaking down the dataset into small subsets. This breaking down process is done by asking questions about the features of the datasets. The idea is to unmix the labels by asking fewer questions necessary. As we ask questions, we are breaking down the dataset into more subsets. Once we have a subgroup with only the unique type of labels, we end the tree in that node. If you would like to get a detailed understanding of Decision tree classifier, please take a look at this kernel.
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2470
dectree_grid = grid.best_estimator_## using the best found hyper paremeters to get the score.dectree_grid.score(X,y)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2479
:::
Let’s look at the feature importance from decision tree grid.
## feature importancefeature_importances = pd.DataFrame(dectree_grid.feature_importances_, index = column_names, columns=['importance'])feature_importances.sort_values(by='importance', ascending=False).head(10)
These are the top 10 features determined by Decision Tree helped classifing the fates of many passenger on Titanic on that night.
7f. Random Forest Classifier
I admire working with decision trees because of the potential and basics they provide towards building a more complex model like Random Forest(RF). RF is an ensemble method (combination of many decision trees) which is where the “forest” part comes in. One crucial details about Random Forest is that while using a forest of decision trees, RF model takes random subsets of the original dataset(bootstrapped) and random subsets of the variables(features/columns). Using this method, the RF model creates 100’s-1000’s(the amount can be menually determined) of a wide variety of decision trees. This variety makes the RF model more effective and accurate. We then run each test data point through all of these 100’s to 1000’s of decision trees or the RF model and take a vote on the output.
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2534
from sklearn.metrics importclassification_report# Print classification report for y_testprint(classification_report(y_test, y_pred, labels=rf_grid.classes_))
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
There are two types of ensemple learnings.
Bagging/Averaging Methods
In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.
Boosting Methods
The other family of ensemble methods are boosting methods, where base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.
Bagging Classifier(Bootstrap Aggregating) is the ensemble method that involves manipulating the training set by resampling and running algorithms on it. Let’s do a quick review:
Bagging classifier uses a process called bootstrapped dataset to create multiple datasets from one original dataset and runs algorithm on each one of them. Here is an image to show how bootstrapped dataset works.
Resampling from original dataset to bootstrapped datasets
Source: https://uc-r.github.io
After running a learning algorithm on each one of the bootstrapped datasets, all models are combined by taking their average. the test data/new data then go through this averaged classifier/combined classifier and predict the output.
Here is an image to make it clear on how bagging works,
Source: https://prachimjoshi.files.wordpress.com
Please check out this kernel if you want to find out more about bagging classifier.
from sklearn.ensemble importBaggingClassifiern_estimators = [10,30,50,70,80,150,160, 170,175,180,185];cv =StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)parameters = {'n_estimators':n_estimators, }grid =GridSearchCV(BaggingClassifier(base_estimator=None, ## If None, then the base estimator is a decision tree. bootstrap_features=False), param_grid=parameters, cv=cv, n_jobs =-1)grid.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2628
Why use Bagging? (Pros and cons)
Bagging works best with strong and complex models(for example, fully developed decision trees). However, don’t let that fool you to thinking that similar to a decision tree, bagging also overfits the model. Instead, bagging reduces overfitting since a lot of the sample training data are repeated and used to create base estimators. With a lot of equally likely training data, bagging is not very susceptible to overfitting with noisy data, therefore reduces variance. However, the downside is that this leads to an increase in bias.
Random Forest VS. Bagging Classifier
If some of you are like me, you may find Random Forest to be similar to Bagging Classifier. However, there is a fundamental difference between these two which is Random Forests ability to pick subsets of features in each node. I will elaborate on this in a future update.
7h. AdaBoost Classifier
AdaBoost is another ensemble model and is quite different than Bagging. Let’s point out the core concepts.
AdaBoost combines a lot of “weak learners”(they are also called stump; a tree with only one node and two leaves) to make classifications.
This base model fitting is an iterative process where each stump is chained one after the other; It cannot run in parallel.
Some stumps get more say in the final classifications than others. The models use weights that are assigned to each data point/raw indicating their “importance.” Samples with higher weight have a higher influence on the total error of the next model and gets more priority. The first stump starts with uniformly distributed weight which means, in the beginning, every datapoint have an equal amount of weights.
Each stump is made by talking the previous stump’s mistakes into account. After each iteration weights gets re-calculated in order to take the errors/misclassifications from the last stump into consideration.
The final prediction is typically constructed by a weighted vote where weights for each base model depends on their training errors or misclassification rates.
To illustrate what we have talked about so far let’s look at the following visualization.
Source: Diogo(Medium)
Let’s dive into each one of the nitty-gritty stuff about AdaBoost:
First, we determine the best feature to split the dataset using Gini index(basics from decision tree). The feature with the lowest Gini index becomes the first stump in the AdaBoost stump chain(the lower the Gini index is, the better unmixed the label is, therefore, better split).
Secondly, we need to determine how much say a stump will have in the final classification and how we can calculate that.
We learn how much say a stump has in the final classification by calculating how well it classified the samples (aka calculate the total error of the weight).
The Total Error for a stump is the sum of the weights associated with the incorrectly classified samples. For example, lets say, we start a stump with 10 datasets. The first stump will uniformly distribute an weight amoung all the datapoints. Which means each data point will have 1/10 weight. Let’s say once the weight is distributed we run the model and find 2 incorrect predicitons. In order to calculate the total erorr we add up all the misclassified weights. Here we get 1/10 + 1/10 = 2/10 or 1/5. This is our total error. We can also think about it
Since the weight is uniformly distributed(all add up to 1) among all data points, the total error will always be between 0(perfect stump) and 1(horrible stump).
We use the total error to determine the amount of say a stump has in the final classification using the following formula
We can draw a graph to determine the amount of say using the value of total error(0 to 1)
Source: Chris McCormick
The blue line tells us the amount of say for Total Error(Error rate) between 0 and 1.
When the stump does a reasonably good job, and the total error is minimal, then the amount of say(Alpha) is relatively large, and the alpha value is positive.
When the stump does an average job(similar to a coin flip/the ratio of getting correct and incorrect ~50%/50%), then the total error is ~0.5. In this case the amount of say is 0.
When the error rate is high let’s say close to 1, then the amount of say will be negative, which means if the stump outputs a value as “survived” the included weight will turn that value into “not survived.”
P.S. If the Total Error is 1 or 0, then this equation will freak out. A small amount of error is added to prevent this from happening.
Third, We need to learn how to modify the weights so that the next stump will take the errors that the current stump made into account. The pseducode for calculating the new sample weight is as follows.
\[ New Sample Weight = Sample Weight + e^{\alpha_t}\]
Here the \(\alpha_t(AmountOfSay)\) can be positive or negative depending whether the sample was correctly classified or misclassified by the current stump. We want to increase the sample weight of the misclassified samples; hinting the next stump to put more emphasize on those. Inversely, we want to decrease the sample weight of the correctly classified samples; hinting the next stump to put less emphasize on those.
The following equation help us to do this calculation.
\(AdaBoost(X)\) is the classification predictions for \(y\) using predictor matrix \(X\)
\(T\) is the set of “weak learners”
\(\alpha_t\) is the contribution weight for weak learner \(t\)
\(h_t(X)\) is the prediction of weak learner \(t\)
and \(y\) is binary with values -1 and 1
P.S. Since the stump barely captures essential specs about the dataset, the model is highly biased in the beginning. However, as the chain of stumps continues and at the end of the process, AdaBoost becomes a strong tree and reduces both bias and variance.
from sklearn.ensemble importAdaBoostClassifiern_estimators = [100,140,145,150,160, 170,175,180,185];cv =StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)learning_r = [0.1,1,0.01,0.5]parameters = {'n_estimators':n_estimators,'learning_rate':learning_r }grid =GridSearchCV(AdaBoostClassifier(base_estimator=None, ## If None, then the base estimator is a decision tree. ), param_grid=parameters, cv=cv, n_jobs =-1)grid.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2774
Pros and cons of boosting
Pros
Achieves higher performance than bagging when hyper-parameters tuned properly.
Can be used for classification and regression equally well.
Easily handles mixed data types.
Can use “robust” loss functions that make the model resistant to outliers.
Cons
Difficult and time consuming to properly tune hyper-parameters.
Cannot be parallelized like bagging (bad scalability when huge amounts of data).
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
from sklearn.ensemble importExtraTreesClassifierExtraTreesClassifier =ExtraTreesClassifier()ExtraTreesClassifier.fit(X, y)y_pred = ExtraTreesClassifier.predict(X_test)extraTree_accy =round(accuracy_score(y_pred, y_test), 3)print(extraTree_accy)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
from sklearn.gaussian_process importGaussianProcessClassifierGaussianProcessClassifier =GaussianProcessClassifier()GaussianProcessClassifier.fit(X, y)y_pred = GaussianProcessClassifier.predict(X_test)gau_pro_accy =round(accuracy_score(y_pred, y_test), 3)print(gau_pro_accy)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
UndefVarError: `voting_classifier` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2925
all_models = [logreg_grid, knn_grid, knn_ran_grid, svm_grid, dectree_grid, rf_grid, bagging_grid, adaBoost_grid, voting_classifier]c = {}for i in all_models: a = i.predict(X_test) b =accuracy_score(a, y_test) c[i] = b
UndefVarError: `logreg_grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2954
UndefVarError: `c` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
[1] top-level scope
@C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2976
:::
<h1>Resources</h1>
<ul>
<li><b>Statistics</b></li>
<ul>
<li><a href="https://statistics.laerd.com/statistical-guides/measures-of-spread-standard-deviation.php">Types of Standard Deviation</a></li>
<li><a href="https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-is-a-t-test-and-why-is-it-like-telling-a-kid-to-clean-up-that-mess-in-the-kitchen">What Is a t-test? And Why Is It Like Telling a Kid to Clean Up that Mess in the Kitchen?</a></li>
<li><a href="https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statistics">What Are T Values and P Values in Statistics?</a></li>
<li><a href="https://www.youtube.com/watch?v=E4KCfcVwzyw">What is p-value? How we decide on our confidence level.</a></li>
</ul>
<li><b>Writing pythonic code</b></li>
<ul>
<li><a href="https://www.kaggle.com/rtatman/six-steps-to-more-professional-data-science-code">Six steps to more professional data science code</a></li>
<li><a href="https://www.kaggle.com/jpmiller/creating-a-good-analytics-report">Creating a Good Analytics Report</a></li>
<li><a href="https://en.wikipedia.org/wiki/Code_smell">Code Smell</a></li>
<li><a href="https://www.python.org/dev/peps/pep-0008/">Python style guides</a></li>
<li><a href="https://gist.github.com/sloria/7001839">The Best of the Best Practices(BOBP) Guide for Python</a></li>
<li><a href="https://www.python.org/dev/peps/pep-0020/">PEP 20 -- The Zen of Python</a></li>
<li><a href="https://docs.python-guide.org/">The Hitchiker's Guide to Python</a></li>
<li><a href="https://realpython.com/tutorials/best-practices/">Python Best Practice Patterns</a></li>
<li><a href="http://www.nilunder.com/blog/2013/08/03/pythonic-sensibilities/">Pythonic Sensibilities</a></li>
</ul>
<li><b>Why Scikit-Learn?</b></li>
<ul>
<li><a href="https://www.oreilly.com/content/intro-to-scikit-learn/">Introduction to Scikit-Learn</a></li>
<li><a href="https://www.oreilly.com/content/six-reasons-why-i-recommend-scikit-learn/">Six reasons why I recommend scikit-learn</a></li>
<li><a href="https://hub.packtpub.com/learn-scikit-learn/">Why you should learn Scikit-learn</a></li>
<li><a href="https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines">A Deep Dive Into Sklearn Pipelines</a></li>
<li><a href="https://www.kaggle.com/sermakarevich/sklearn-pipelines-tutorial">Sklearn pipelines tutorial</a></li>
<li><a href="https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html">Managing Machine Learning workflows with Sklearn pipelines</a></li>
<li><a href="https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976">A simple example of pipeline in Machine Learning using SKlearn</a></li>
</ul>
</ul>
<h1>Credits</h1>
<ul>
<li>To Brandon Foltz for his <a href="https://www.youtube.com/channel/UCFrjdcImgcQVyFbK04MBEhA">youtube</a> channel and for being an amazing teacher.</li>
<li>To GA where I started my data science journey.</li>
<li>To Kaggle community for inspiring me over and over again with all the resources I need.</li>
<li>To Udemy Course "Deployment of Machine Learning". I have used and modified some of the code from this course to help making the learning process intuitive.</li>
</ul>
If you like to discuss any other projects or just have a chat about data science topics, I’ll be more than happy to connect with you on:
This kernel will always be a work in progress. I will incorporate new concepts of data science as I comprehend them with each update. If you have any idea/suggestions about this notebook, please let me know. Any feedback about further improvements would be genuinely appreciated.
If you have come this far, Congratulations!!
If this notebook helped you in any way or you liked it, please upvote and/or leave a comment!! :)
Source Code
---title: "Análisis Estadístico y ML del Titanic"author: "Fabrizio Torrico"date: "05/07/2025"format: html: code-fold: false code-tools: true code-copy: true toc: true toc-depth: 3 fig-width: 10 fig-height: 6 embed-resources: true error: true warning: true message: true html-math-method: mathjax include-in-header: | <script src="https://public.tableau.com/javascripts/api/viz_v1.js"></script>engine: julia---<img src="http://data.freehdw.com/ships-titanic-vehicles-best.jpg" Width="800"><a id="introduction"></a><br>This kernel is for all aspiring data scientists to learn from and to review their knowledge. We will have a detailed statistical analysis of Titanic data set along with Machine learning model implementation. I am super excited to share my first kernel with the Kaggle community. As I go on in this journey and learn new topics, I will incorporate them with each new updates. So, check for them and please <b>leave a comment</b> if you have any suggestions to make this kernel better!! Going back to the topics of this kernel, I will do more in-depth visualizations to explain the data, and the machine learning classifiers will be used to predict passenger survival status.NOTE:- This is a julia translation- If you are reading this on github, I recommend you read this on <a href="https://www.kaggle.com/masumrumi/a-statistical-analysis-ml-workflow-of-titanic">kaggle</a>- Follow me on github: # Kernel Goals<a id="aboutthiskernel"></a>---There are three primary goals of this kernel.- <b>Do a statistical analysis</b> of how some group of people was survived more than others.- <b>Do an exploratory data analysis(EDA)</b> of titanic with visualizations and storytelling.- <b>Predict</b>: Use machine learning classification models to predict the chances of passengers survival.P.S. If you want to learn more about regression models, try this [kernel](https://www.kaggle.com/masumrumi/a-stats-analysis-and-ml-workflow-of-house-pricing/edit/run/9585160).# Part 1: Importing Necessary Libraries and datasets---<a id="import_libraries**"></a>## 1a. Loading librariesPython is a fantastic language with a vibrant community that produces many amazing libraries. I am not a big fan of importing everything at once for the newcomers. So, I am going to introduce a few necessary libraries for now, and as we go on, we will keep unboxing new libraries when it seems appropriate.```{julia}usingPkgPkg.activate(".")Pkg.add(["IJulia", "DataFrames", "CSV", "CairoMakie", "StatsBase","Statistics", "MLJ", "MLJModels", "HypothesisTests","Distributions", "Missings", "CategoricalArrays", "AlgebraOfGraphics", "Chain"])``````{julia}#| _cell_guid: 80643cb5-64f3-4180-92a9-2f8e83263ac6#| _kg_hide-input: true#| _uuid: 33d54abf387474bce3017f1fc3832493355010c0#| tags: []importDataFrames as DFimportCSVimportCairoMakie as MakieimportAlgebraOfGraphics as AoGimportStatistics as StatsimportStatsBaseimportChain: @chainimportRandom: shuffleimportIJulia``````{julia}readdir("./input/")```## 1b. Loading Datasets<a id="load_data"></a>---After loading the necessary modules, we need to import the datasets. Many of the business problems usually come with a tremendous amount of messy data. We extract those data from many sources. I am hoping to write about that in a different kernel. For now, we are going to work with a less complicated and quite popular machine learning dataset.```{julia}## Importing the datasetsusingCSVtrain = CSV.read("./input/train.csv", DF.DataFrame)test = CSV.read("./input/test.csv", DF.DataFrame);```You are probably wondering why two datasets? Also, Why have I named it "train" and "test"? To explain that I am going to give you an overall picture of the supervised machine learning process."Machine Learning" is simply "Machine" and "Learning". Nothing more and nothing less. In a supervised machine learning process, we are giving machine/computer/models specific inputs or data(text/number/image/audio) to learn from aka we are training the machine to learn certain aspects based on the data and the output. Now, how can we determine that machine is actually learning what we are try to teach? That is where the test set comes to play. We withhold part of the data where we know the output/result of each datapoints, and we use this data to test the trained models. We then compare the outcomes to determine the performance of the algorithms. If you are a bit confused thats okay. I will explain more as we keep reading. Let's take a look at sample datasets.```{julia}DF.first(train, 5)print(train.Pclass)``````{julia}@chain train begin DF.dropmissing(:Age) # Drop rows with missing Age DF.groupby(:Sex) DF.combine(:Age => minimum =>:MinAge)end``````{julia}DF.describe(train, :eltype)```## 1c. A Glimpse of the Datasets.<a id="glimpse"></a>---# Train Set```{julia}DF.first(train[shuffle(1:DF.nrow(train))[1:5], :], 5)```# Test Set```{julia}DF.first(test[shuffle(1:DF.nrow(test))[1:5], :], 5)```This is a sample of train and test dataset. Lets find out a bit more about the train and test dataset.```{julia}println("The shape of the train data is (row, column): $(size(train))")println("Train dataset info:")DF.describe(train)println("The shape of the test data is (row, column): $(size(test))")println("Test dataset info:")DF.describe(test)```## 1d. About This Dataset<a id="aboutthisdataset"></a>---The data has split into two groups:- training set (train.csv)- test set (test.csv)**_The training set includes our target variable(dependent variable), passenger survival status_** (also known as the ground truth from the Titanic tragedy) along with other independent features like gender, class, fare, and Pclass.The test set should be used to see how well our model performs on unseen data. When we say unseen data, we mean that the algorithm or machine learning models have no relation to the test data. We do not want to use any part of the test data in any way to modify our algorithms; Which are the reasons why we clean our test data and train data separately. **_The test set does not provide passengers survival status_**. We are going to use our model to predict passenger survival status.Now let's go through the features and describe a little. There is a couple of different type of variables, They are...---**Categorical:**- **Nominal**(variables that have two or more categories, but which do not have an intrinsic order.) > - **Cabin** > - **Embarked**(Port of Embarkation) C(Cherbourg) Q(Queenstown) S(Southampton)- **Dichotomous**(Nominal variable with only two categories) > - **Sex** Female Male- **Ordinal**(variables that have two or more categories just like nominal variables. Only the categories can also be ordered or ranked.) > - **Pclass** (A proxy for socio-economic status (SES)) 1(Upper) 2(Middle) 3(Lower)---**Numeric:**- **Discrete** > - **Passenger ID**(Unique identifing # for each passenger) > - **SibSp** > - **Parch** > - **Survived** (Our outcome or dependent variable) 0 1- **Continous** > - **Age** > - **Fare**---**Text Variable**> - **Ticket** (Ticket number for passenger.)> - **Name**( Name of the passenger.)## 1e. Tableau Visualization of the Data<a id='tableau_visualization'></a>---I have incorporated a tableau visualization below of the training data. This visualization...- is for us to have an overview and play around with the dataset.- is done without making any changes(including Null values) to any features of the dataset.---Let's get a better perspective of the dataset through this visualization.```{=html}<div class='tableauPlaceholder' id='viz1516349898238' style='position: relative'><noscript><a href='#'><img alt='An Overview of Titanic Training Dataset ' src='https://public.tableau.com/static/images/Ti/Titanic_data_mining/Dashboard1/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Titanic_data_mining/Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Ti/Titanic_data_mining/Dashboard1/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1516349898238'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>```We want to see how the left vertical bar changes when we filter out unique values of certain features. We can use multiple filters to see if there are any correlations among them. For example, if we click on **upper** and **Female** tab, we would see that green color dominates the bar with a ratio of 91:3 survived and non survived female passengers; a 97% survival rate for females. We can reset the filters by clicking anywhere in the whilte space. The age distribution chart on top provides us with some more info such as, what was the age range of those three unlucky females as the red color give away the unsurvived once. If you would like to check out some of my other tableau charts, please click [here.](https://public.tableau.com/profile/masum.rumi#!/)# Part 2: Overview and Cleaning the Data<a id="cleaningthedata"></a>---## 2a. OverviewDatasets in the real world are often messy, However, this dataset is almost clean. Lets analyze and see what we have here.```{julia}#| _cell_guid: bf19c831-fbe0-49b6-8bf8-d7db118f40b1#| _kg_hide-input: true#| _uuid: 5a0593fb4564f0284ca7fdf5c006020cb288db95#| execution: {iopub.execute_input: '2021-06-26T16:35:08.956119Z', iopub.status.busy: '2021-06-26T16:35:08.955538Z', iopub.status.idle: '2021-06-26T16:35:08.973222Z', shell.execute_reply: '2021-06-26T16:35:08.972151Z', shell.execute_reply.started: '2021-06-26T16:35:08.956072Z'}DF.describe(train, :nmissing, :eltype)```It looks like, the features have unequal amount of data entries for every column and they have many different types of variables. This can happen for the following reasons...- We may have missing values in our features.- We may have categorical features.- We may have alphanumerical or/and text features.## 2b. Dealing with Missing values<a id="dealwithnullvalues"></a>---**Missing values in _train_ dataset.**```{julia}#| _kg_hide-input: true#| execution: {iopub.execute_input: '2021-06-26T16:35:08.975451Z', iopub.status.busy: '2021-06-26T16:35:08.974927Z', iopub.status.idle: '2021-06-26T16:35:08.98326Z', shell.execute_reply: '2021-06-26T16:35:08.982644Z', shell.execute_reply.started: '2021-06-26T16:35:08.975205Z'}functionmissing_percentage(df::DF.DataFrame)"""This function takes a DataFrame as input and returns total missing values and percentages""" missing_counts = [count(ismissing, df[!, col]) for col in DF.names(df)] missing_pct =round.(missing_counts ./ DF.nrow(df) .*100, digits=2)# Create result DataFrame result = DF.DataFrame( Column = DF.names(df), Total = missing_counts, Percent = missing_pct )# Sort by total missing values (descending)return DF.sort(result, :Total, rev=true)end``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:09.092256Z', iopub.status.busy: '2021-06-26T16:35:09.09199Z', iopub.status.idle: '2021-06-26T16:35:09.108063Z', shell.execute_reply: '2021-06-26T16:35:09.107054Z', shell.execute_reply.started: '2021-06-26T16:35:09.092212Z'}missing_percentage(train)```**Missing values in _test_ set.**```{julia}#| _cell_guid: 073ef91b-e401-47a1-9b0a-d08ad710abce#| _kg_hide-input: true#| _uuid: 1ec1de271f57c9435ce111261ba08c5d6e34dbcb#| execution: {iopub.execute_input: '2021-06-26T16:35:09.208229Z', iopub.status.busy: '2021-06-26T16:35:09.207968Z', iopub.status.idle: '2021-06-26T16:35:09.221423Z', shell.execute_reply: '2021-06-26T16:35:09.220732Z', shell.execute_reply.started: '2021-06-26T16:35:09.208186Z'}missing_percentage(test)```We see that in both **train**, and **test** dataset have missing values. Let's make an effort to fill these missing values starting with "Embarked" feature.### Embarked feature---```{julia}#| _kg_hide-input: true#| execution: {iopub.execute_input: '2021-06-26T16:35:09.223175Z', iopub.status.busy: '2021-06-26T16:35:09.222681Z', iopub.status.idle: '2021-06-26T16:35:09.230671Z', shell.execute_reply: '2021-06-26T16:35:09.229793Z', shell.execute_reply.started: '2021-06-26T16:35:09.223128Z'}functionpercent_value_counts(df::DF.DataFrame, feature::Symbol)"""This function takes a dataframe and a column and finds the percentage of the value_counts"""# Count values including missing counts = DF.combine(DF.groupby(df, feature), DF.nrow =>:Total)# Calculate percentages counts.Percent =round.(counts.Total ./ DF.nrow(df) .*100, digits=2)# Sort by total count (descending)return DF.sort(counts, :Total, rev=true)end``````{julia}#| _kg_hide-input: true#| execution: {iopub.execute_input: '2021-06-26T16:35:09.236974Z', iopub.status.busy: '2021-06-26T16:35:09.236548Z', iopub.status.idle: '2021-06-26T16:35:09.254321Z', shell.execute_reply: '2021-06-26T16:35:09.253654Z', shell.execute_reply.started: '2021-06-26T16:35:09.236929Z'}percent_value_counts(train, :Embarked)```It looks like there are only two null values( ~ 0.22 %) in the Embarked feature, we can replace these with the mode value "S". However, let's dig a little deeper.**Let's see what are those two null values**```{julia}#| _cell_guid: 000ebdd7-ff57-48d9-91bf-a29ba79f1a1c#| _kg_hide-input: true#| _uuid: 6b9cb050e9dae424bb738ba9cdf3c84715887fa3#| execution: {iopub.execute_input: '2021-06-26T16:35:09.276102Z', iopub.status.busy: '2021-06-26T16:35:09.275649Z', iopub.status.idle: '2021-06-26T16:35:09.292037Z', shell.execute_reply: '2021-06-26T16:35:09.291163Z', shell.execute_reply.started: '2021-06-26T16:35:09.275879Z'}train[ismissing.(train.Embarked), :]```We may be able to solve these two missing values by looking at other independent variables of the two raws. Both passengers paid a fare of $80, are of Pclass 1 and female Sex. Let's see how the **Fare** is distributed among all **Pclass** and **Embarked** feature values```{julia}#| _cell_guid: bf257322-0c9c-4fc5-8790-87d8c94ad28a#| _kg_hide-input: true#| _uuid: ad15052fe6cebe37161c6e01e33a5c083dc2b558#| execution: {iopub.execute_input: '2021-06-26T16:35:09.293919Z', iopub.status.busy: '2021-06-26T16:35:09.293564Z', iopub.status.idle: '2021-06-26T16:35:09.866643Z', shell.execute_reply: '2021-06-26T16:35:09.865701Z', shell.execute_reply.started: '2021-06-26T16:35:09.293817Z'}fig = Makie.Figure()# Prepare data for plottingtrain_clean = DF.dropmissing(train, [:Embarked, :Fare, :Pclass])test_clean = DF.dropmissing(test, [:Embarked, :Fare, :Pclass])# Create mapping for embarked ports to numbersunique_categories =unique(train_clean.Embarked)category_to_index =Dict(category => i for (i, category) inenumerate(unique_categories))# Convert categorical to numerictrain_clean.Embarked_num = [category_to_index[port] for port in train_clean.Embarked]test_clean.Embarked_num = [category_to_index[port] for port in test_clean.Embarked]# Training set boxplotax1 = Makie.Axis(fig[1, 1], title ="Training Set", xlabel ="Embarked", ylabel ="Fare", xticks = (1:3, unique_categories))ax2 = Makie.Axis(fig[1, 2], title ="Test Set", xlabel ="Embarked", ylabel ="Fare", xticks = (1:3, unique_categories))Makie.boxplot!(ax2, test_clean.Embarked_num, test_clean.Fare, dodge = test_clean.Pclass, color = test_clean.Pclass)Makie.boxplot!(ax1, train_clean.Embarked_num, train_clean.Fare, dodge = train_clean.Pclass, color = train_clean.Pclass)fig```Here, in both training set and test set, the average fare closest to $80 are in the <b>C</b> Embarked values where pclass is 1. So, let's fill in the missing values as "C"```{julia}#| _cell_guid: 2f5f3c63-d22c-483c-a688-a5ec2a477330#| _kg_hide-input: true#| _uuid: 52e51ada5dfeb700bf775c66e9307d6d1e2233de#| execution: {iopub.execute_input: '2021-06-26T16:35:09.868523Z', iopub.status.busy: '2021-06-26T16:35:09.868016Z', iopub.status.idle: '2021-06-26T16:35:09.874135Z', shell.execute_reply: '2021-06-26T16:35:09.873022Z', shell.execute_reply.started: '2021-06-26T16:35:09.868249Z'}#| scrolled: true## Replacing the null values in the Embarked column with the mode.train.Embarked =coalesce.(train.Embarked, "C");```### Cabin Feature---```{julia}#| _cell_guid: e76cd770-b498-4444-b47a-4ac6ae63193b#| _kg_hide-input: true#| _uuid: b809a788784e2fb443457d7ef4ca17a896bf58b4#| execution: {iopub.execute_input: '2021-06-26T16:35:09.876171Z', iopub.status.busy: '2021-06-26T16:35:09.875621Z', iopub.status.idle: '2021-06-26T16:35:09.886193Z', shell.execute_reply: '2021-06-26T16:35:09.885088Z', shell.execute_reply.started: '2021-06-26T16:35:09.875859Z'}#| scrolled: trueprintln("Train Cabin missing: $(count(ismissing, train.Cabin) / DF.nrow(train))")println("Test Cabin missing: $(count(ismissing, test.Cabin) / DF.nrow(test))")```Approximately 77% of Cabin feature is missing in the training data and 78% missing on the test data.We have two choices,- we can either get rid of the whole feature, or- we can brainstorm a little and find an appropriate way to put them in use. For example, We may say passengers with cabin record had a higher socio-economic-status then others. We may also say passengers with cabin record were more likely to be taken into consideration when loading into the boat.Let's combine train and test data first and for now, will assign all the null values as **"N"**```{julia}#| _kg_hide-input: true#| _uuid: 8ff7b4f88285bc65d72063d7fdf8a09a5acb62d3#| execution: {iopub.execute_input: '2021-06-26T16:35:09.888377Z', iopub.status.busy: '2021-06-26T16:35:09.88784Z', iopub.status.idle: '2021-06-26T16:35:09.902296Z', shell.execute_reply: '2021-06-26T16:35:09.901697Z', shell.execute_reply.started: '2021-06-26T16:35:09.888114Z'}survivors = train.SurvivedDF.select!(train, DF.Not(:Survived)) # Remove Survived columnall_data =vcat(train, test)all_data.Cabin =coalesce.(all_data.Cabin, "N");```All the cabin names start with an English alphabet following by multiple digits. It seems like there are some passengers that had booked multiple cabin rooms in their name. This is because many of them travelled with family. However, they all seem to book under the same letter followed by different numbers. It seems like there is a significance with the letters rather than the numbers. Therefore, we can group these cabins according to the letter of the cabin name.```{julia}#| _cell_guid: 87995359-8a77-4e38-b8bb-e9b4bdeb17ed#| _kg_hide-input: true#| _uuid: c1e9e06eb7f2a6eeb1a6d69f000217e7de7d5f25#| execution: {iopub.execute_input: '2021-06-26T16:35:09.904181Z', iopub.status.busy: '2021-06-26T16:35:09.903766Z', iopub.status.idle: '2021-06-26T16:35:09.909654Z', shell.execute_reply: '2021-06-26T16:35:09.908573Z', shell.execute_reply.started: '2021-06-26T16:35:09.904014Z'}all_data.Cabin = [string(cabin[1]) for cabin in all_data.Cabin];```Now let's look at the value counts of the cabin features and see how it looks.```{julia}#| _kg_hide-input: true#| execution: {iopub.execute_input: '2021-06-26T16:35:09.91156Z', iopub.status.busy: '2021-06-26T16:35:09.911098Z', iopub.status.idle: '2021-06-26T16:35:09.928945Z', shell.execute_reply: '2021-06-26T16:35:09.928025Z', shell.execute_reply.started: '2021-06-26T16:35:09.911398Z'}percent_value_counts(all_data, :Cabin)```So, We still haven't done any effective work to replace the null values. Let's stop for a second here and think through how we can take advantage of some of the other features here.- We can use the average of the fare column We can use pythons **_groupby_** function to get the mean fare of each cabin letter.```{julia}#| _kg_hide-input: true#| execution: {iopub.execute_input: '2021-06-26T16:35:09.930774Z', iopub.status.busy: '2021-06-26T16:35:09.930283Z', iopub.status.idle: '2021-06-26T16:35:09.942122Z', shell.execute_reply: '2021-06-26T16:35:09.941067Z', shell.execute_reply.started: '2021-06-26T16:35:09.930532Z'}@chain all_data begin DF.dropmissing(:Fare) DF.groupby(:Cabin) DF.combine(:Fare => Stats.mean =>:Mean_Fare) DF.sort(:Mean_Fare)end```Now, these means can help us determine the unknown cabins, if we compare each unknown cabin rows with the given mean's above. Let's write a simple function so that we can give cabin names based on the means.```{julia}#| _kg_hide-input: true#| _uuid: a466da29f1989fa983147faf9e63d18783468567#| execution: {iopub.execute_input: '2021-06-26T16:35:09.943855Z', iopub.status.busy: '2021-06-26T16:35:09.943364Z', iopub.status.idle: '2021-06-26T16:35:09.952677Z', shell.execute_reply: '2021-06-26T16:35:09.952057Z', shell.execute_reply.started: '2021-06-26T16:35:09.943627Z'}functioncabin_estimator(fare::Union{Float64, Missing})"""Grouping cabin feature by the first letter based on fare"""# Handle missing valuesifismissing(fare)return"N"# Default cabin for missing fareendif fare <16return"G"elseif16≤ fare <27return"F"elseif27≤ fare <38return"T"elseif38≤ fare <47return"A"elseif47≤ fare <53return"E"elseif53≤ fare <54return"D"elseif54≤ fare <116return"C"elsereturn"B"endend```Let's apply <b>cabin_estimator</b> function in each unknown cabins(cabin with <b>null</b> values). Once that is done we will separate our train and test to continue towards machine learning modeling.```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:09.95455Z', iopub.status.busy: '2021-06-26T16:35:09.954083Z', iopub.status.idle: '2021-06-26T16:35:09.96302Z', shell.execute_reply: '2021-06-26T16:35:09.962357Z', shell.execute_reply.started: '2021-06-26T16:35:09.95437Z'}with_N = all_data[all_data.Cabin .=="N", :]without_N = all_data[all_data.Cabin .!="N", :];``````{julia}#| _kg_hide-input: true#| _uuid: 1c646b64c6e062656e5f727d5499266f847c4832#| execution: {iopub.execute_input: '2021-06-26T16:35:09.965179Z', iopub.status.busy: '2021-06-26T16:35:09.96464Z', iopub.status.idle: '2021-06-26T16:35:09.981536Z', shell.execute_reply: '2021-06-26T16:35:09.980705Z', shell.execute_reply.started: '2021-06-26T16:35:09.964885Z'}with_N.Cabin =cabin_estimator.(with_N.Fare)# Combine back togetherall_data =vcat(with_N, without_N)# Sort by PassengerIdDF.sort!(all_data, :PassengerId)# Separate train and testtrain = all_data[1:891, :]test = all_data[892:end, :]# Add back survival informationtrain.Survived = survivors;```### Fare Feature---If you have paid attention so far, you know that there is only one missing value in the fare column. Let's have it.```{julia}print("test")test[ismissing.(test.Fare), :]```Here, We can take the average of the **Fare** column to fill in the NaN value. However, for the sake of learning and practicing, we will try something else. We can take the average of the values where**Pclass** is **_3_**, **Sex** is **_male_** and **Embarked** is **_S_**```{julia}#| _cell_guid: e742aa76-b6f8-4882-8bd6-aa10b96f06aa#| _kg_hide-input: true#| _uuid: f1dc8c6c33ba7df075ee608467be2a83dc1764fd#| execution: {iopub.execute_input: '2021-06-26T16:35:10.002749Z', iopub.status.busy: '2021-06-26T16:35:10.002232Z', iopub.status.idle: '2021-06-26T16:35:10.012662Z', shell.execute_reply: '2021-06-26T16:35:10.011431Z', shell.execute_reply.started: '2021-06-26T16:35:10.00248Z'}missing_value =@chain test begin DF.subset(:Pclass => x -> x .==3, :Embarked => x -> x .=="S", :Sex => x -> x .=="male") _.Fare skipmissing Stats.meanend# Replace missing faretest.Fare =coalesce.(test.Fare, missing_value);```### Age Feature---We know that the feature "Age" is the one with most missing values, let's see it in terms of percentage.```{julia}#| _cell_guid: 8ff25fb3-7a4a-4e06-b48f-a06b8d844917#| _kg_hide-input: true#| _uuid: c356e8e85f53a27e44b5f28936773a289592c5eb#| execution: {iopub.execute_input: '2021-06-26T16:35:10.014347Z', iopub.status.busy: '2021-06-26T16:35:10.014023Z', iopub.status.idle: '2021-06-26T16:35:10.024214Z', shell.execute_reply: '2021-06-26T16:35:10.023404Z', shell.execute_reply.started: '2021-06-26T16:35:10.014284Z'}println("Train age missing value: $(round(count(ismissing, train.Age) / DF.nrow(train) *100, digits=2))%")println("Test age missing value: $(round(count(ismissing, test.Age) / DF.nrow(test) *100, digits=2))%")```We will take a different approach since **~20% data in the Age column is missing** in both train and test dataset. The age variable seems to be promising for determining survival rate. Therefore, It would be unwise to replace the missing values with median, mean or mode. We will use machine learning model Random Forest Regressor to impute missing value instead of Null value. We will keep the age column unchanged for now and work on that in the feature engineering section.# Part 3. Visualization and Feature Relations<a id="visualization_and_feature_relations"></a>---Before we dive into finding relations between independent variables and our dependent variable(survivor), let us create some assumptions about how the relations may turn-out among features.**Assumptions:**- Gender: More female survived than male- Pclass: Higher socio-economic status passenger survived more than others.- Age: Younger passenger survived more than other passengers.- Fare: Passenger with higher fare survived more that other passengers. This can be quite correlated with Pclass.Now, let's see how the features are related to each other by creating some visualizations.## 3a. Gender and Survived<a id="gender_and_survived"></a>---```{julia}Makie.set_theme!(Makie.theme_light())``````{julia}fig1 = Makie.Figure()ax1 = Makie.Axis(fig1[1, 1], title ="Survived/Non-Survived Passenger Gender Distribution", xlabel ="Sex", ylabel ="% of passenger survived", xticks= (1:2, ["Male", "Female"]),)# Calculate survival rates by gendersurvival_by_sex =@chain train begin DF.groupby(:Sex) DF.combine(:Survived => Stats.mean =>:survival_rate) DF.sort(:Sex, rev=true) # Female firstend# Create elegant barplotMakie.barplot!(ax1, 1:2, survival_by_sex.survival_rate, color = ["green", "pink"], strokewidth =2, strokecolor =:black)fig1```This bar plot above shows the distribution of female and male survived. The **_x_label_** represents **Sex** feature while the **_y_label_** represents the % of **passenger survived**. This bar plot shows that ~74% female passenger survived while only ~19% male passenger survived.```{julia}fig = Makie.Figure()ax = Makie.Axis(fig[1, 1], title ="Passenger Gender Distribution - Survived vs Not-survived", xlabel ="Sex", ylabel ="# of Passenger Survived", xticks = (1:2, ["Male", "Female"]))# Count data for grouped bar chartcount_data =@chain train begin DF.groupby([:Sex, :Survived]) DF.combine(DF.nrow =>:count) DF.unstack(:Survived, :count, fill=0)end# Create grouped bar chartcounts = [count_data[1, 2], count_data[1, 3], count_data[2, 2], count_data[2, 3]]Makie.barplot!(ax, [1, 1, 2, 2], counts, dodge = [1, 2, 1,2], color = ["gray", "green", "gray", "green"])# Add legendMakie.Legend(fig[1, 2], [Makie.PolyElement(color ="gray"), Makie.PolyElement(color ="green")], ["Not Survived", "Survived"],"Survival Status")fig```This count plot shows the actual distribution of male and female passengers that survived and did not survive. It shows that among all the females ~ 230 survived and ~ 70 did not survive. While among male passengers ~110 survived and ~480 did not survive.**Summary**---- As we suspected, female passengers have survived at a much better rate than male passengers.- It seems about right since females and children were the priority.## 3b. Pclass and Survived<a id="pcalss_and_survived"></a>---```{julia}fig3 = Makie.Figure()ax3 = Makie.Axis(fig3[1, 1], title ="Passenger Class Distribution - Survival Percentage", xlabel ="Passenger Class", ylabel ="Percentage", titlesize =20, xlabelsize =16, ylabelsize =16, xticks=(1:3, ["1st Class", "2nd Class", "3rd Class"]))# Calculate percentages by classclass_survival =@chain train begin DF.groupby([:Pclass, :Survived]) DF.combine(DF.nrow =>:count) DF.unstack(:Survived, :count, fill=0)endprintln(class_survival)no_survived = class_survival[:, 2] # Second column (not survived)yes_survived = class_survival[:, 3] # Third column (survived)total_by_class = no_survived + yes_survivedprintln(total_by_class)survived_percentage = (yes_survived ./ total_by_class) *100not_survived_percentage = (no_survived ./ total_by_class) *100println(survived_percentage)flatten =vcat(not_survived_percentage ,survived_percentage)Makie.barplot!(ax3, [1, 2, 3, 1, 2, 3], flatten, stack=[1, 2, 3, 1, 2, 3], color = ["red", "red", "red", "green", "green", "green"], strokewidth =1, strokecolor =:black)# Add legendMakie.Legend(fig3[1, 2], [Makie.PolyElement(color ="#F44336"), Makie.PolyElement(color ="#4CAF50")], ["Not Survived", "Survived"],"Survival Status")fig3``````{julia}Makie.barplot([1, 2, 3], survived_percentage, axis=(xticks=(1:3, ["1st Class", "2nd Class", "3rd Class"]), title ="Passenger Class Distribution - Survived vs Non-Survived"),)```- It looks like ... - ~ 63% first class passenger survived titanic tragedy, while - ~ 48% second class and - ~ only 24% third class passenger survived.```{julia}fig = Makie.Figure( title ="Passenger Class Distribution - Survived vs Non-Survived", xlabel ="Passenger Class", ylabel ="Density of Passenger Survived",) # Adjust figure size as neededax = Makie.Axis(fig[1, 1], xticks = ([1, 2, 3], ["Upper", "Middle", "Lower"])) not_survived = train.Pclass[train.Survived .==0]survived = train.Pclass[train.Survived .==1]d1 = Makie.density!(ax, train.Pclass[train.Survived .==0], color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)d2= Makie.density!(ax, train.Pclass[train.Survived .==1], color = (:green, 0.2), strokecolor=:green, strokewidth=2)Makie.axislegend(ax, [d1, d2], ["Not Survived", "Survived"],"Survival Status")fig```This KDE plot is pretty self-explanatory with all the labels and colors. Something I have noticed that some readers might find questionable is that the lower class passengers have survived more than second-class passengers. It is true since there were a lot more third-class passengers than first and second.**Summary**---The first class passengers had the upper hand during the tragedy. You can probably agree with me more on this, in the next section of visualizations where we look at the distribution of ticket fare and survived column.## 3c. Fare and Survived<a id="fare_and_survived"></a>---```{julia}fig = Makie.Figure()ax = Makie.Axis(fig[1, 1], title ="Fare Distribution - Survived vs Non-Survived", xlabel ="Fare", ylabel ="Density of Passenger Survived",)not_survived = train.Fare[train.Survived .==0]survived = train.Fare[train.Survived .==1]d1 = Makie.density!(ax, not_survived, color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)d2 = Makie.density!(ax, survived, color = (:green, 0.2), strokecolor=:green, strokewidth=2)Makie.axislegend(ax, [d1, d2], ["Not Survived", "Survived"],"Survival Status")fig```This plot shows something impressive..- The spike in the plot under 100 dollar represents that a lot of passengers who bought the ticket within that range did not survive.- When fare is approximately more than 280 dollars, there is no gray shade which means, either everyone passed that fare point survived or maybe there is an outlier that clouds our judgment. Let's check...```{julia}train[train.Fare .>280, :]```As we assumed, it looks like an outlier with a fare of $512. We sure can delete this point. However, we will keep it for now.## 3d. Age and Survived<a id="age_and_survived"></a>---```{julia}fig = Makie.Figure()ax = Makie.Axis(fig[1, 1], title ="Age Distribution - Survived vs Non-Survived", xlabel ="Age", ylabel ="Density of Passenger Survived")# clean missing firstclean_train = DF.dropmissing(train, :Age)not_survived = clean_train.Age[clean_train.Survived .==0]survived = clean_train.Age[clean_train.Survived .==1]d1 = Makie.density!(ax, not_survived, color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)d2 = Makie.density!(ax, survived, color = (:green, 0.2), strokecolor=:green, strokewidth=2)Makie.axislegend(ax, [d1, d2], ["Not Survived", "Survived"],"Survival Status")fig```There is nothing out of the ordinary about this plot, except the very left part of the distribution. This may hint on the posibility that children and infants were the priority.## 3e. Combined Feature Relations<a id='combined_feature_relations'></a>---In this section, we are going to discover more than two feature relations in a single graph. I will try my best to illustrate most of the feature relations. Let's get to it.```{julia}fig8 = Makie.Figure(title="Survived by Sex and Age")# Create subplots for each combinationfor (i, (sex, survived)) inenumerate(Iterators.product(["female", "male"], [0, 1])) ax = Makie.Axis(fig8[div(i -1, 2) +1, i %2+1], title ="$sex$survived", xlabel ="Age", ylabel ="Count" ) subset_data = train[(train.Sex .== sex) .& (train.Survived .== survived) .& .!ismissing.(train.Age), :]if DF.nrow(subset_data) >0 Makie.hist!(ax, subset_data.Age, bins =20, color = survived ==1 ? "green":"gray", strokewidth =1, strokecolor =:white)endendfig8```Facetgrid is a great way to visualize multiple variables and their relationships at once. From the chart in section 3a we have a intuation that female passengers had better prority than males during the tragedy. However, from this facet grid, we can also understand which age range groups survived more than others or were not so lucky```{julia}fig8 = Makie.Figure(title="Survived by Sex and Age")# Create subplots for each combinationfor (i, (sex, embarked)) inenumerate(Iterators.product(["female", "male"], ["S", "C", "Q"])) ax = Makie.Axis(fig8[div(i -1, 2) +1, i %2+1], title ="$sex$embarked", ) subset_data = train[(train.Sex .== sex) .& (train.Embarked .== embarked) .& .!ismissing.(train.Age), :]for (survived) in [0, 1] subset_survived = subset_data[(subset_data.Survived .== survived), :]println("Length of subset: $(DF.nrow(subset_survived))")if DF.nrow(subset_data) >0 Makie.hist!(ax, subset_survived.Age, bins =20, color = survived ==1 ? (:green, 0.5) : (:gray, 0.5), strokewidth =1, strokecolor =:white, label = survived ==1 ? "Survived":"Not Survived" )endendendMakie.Legend(fig8[1, 3], [Makie.PolyElement(color = (:gray, 0.7)), Makie.PolyElement(color = (:green, 0.7))], ["Not Survived", "Survived"],"Survival Status")fig8```This is another compelling facet grid illustrating four features relationship at once. They are **Embarked, Age, Survived & Sex**.- The color illustrates passengers survival status(green represents survived, gray represents not survived)- The column represents Sex(left being male, right stands for female)- The row represents Embarked(from top to bottom: S, C, Q)---Now that I have steered out the apparent let's see if we can get some insights that are not so obvious as we look at the data.- Most passengers seem to be boarded on Southampton(S).- More than 60% of the passengers died boarded on Southampton.- More than 60% of the passengers lived boarded on Cherbourg(C).- Pretty much every male that boarded on Queenstown(Q) did not survive.- There were very few females boarded on Queenstown, however, most of them survived.```{julia}fig9 = Makie.Figure(resolution = (1000, 600))ax9_m = Makie.Axis(fig9[1, 1], title ="Male", xlabel ="Fare", ylabel ="Age")# Female subplotax9_f = Makie.Axis(fig9[1, 2], title ="Female", xlabel ="Fare", ylabel ="Age")female_data = train[(train.Sex .=="female") .& .!ismissing.(train.Age), :]male_data = train[(train.Sex .=="male") .& .!ismissing.(train.Age), :]Makie.scatter!(ax9_m, male_data.Fare, male_data.Age, color = [s ==1 ? "green":"gray" for s in male_data.Survived], strokewidth=1, strokecolor="white", markersize=14)Makie.scatter!(ax9_f, female_data.Fare, female_data.Age, color = [s ==1 ? "green":"gray" for s in female_data.Survived], strokewidth=1, strokecolor="white", markersize=14)# Add legendMakie.Legend(fig9[1, 3], [Makie.MarkerElement(color ="gray", marker =:circle), Makie.MarkerElement(color ="green", marker =:circle)], ["Not Survived", "Survived"],"Survived")Makie.Label(fig9[0, :], "Survived by Sex, Fare and Age")fig9```This facet grid unveils a couple of interesting insights. Let's find out.- The grid above clearly demonstrates the three outliers with Fare of over \$500. At this point, I think we are quite confident that these outliers should be deleted.- Most of the passengers were with in the Fare range of \$100.```{julia}fig10 = Makie.Figure(size = (800, 600))ax10 = Makie.Axis(fig10[1, 1], title ="Parents/Children Survival Rate", xlabel ="Number of Parents/Children", ylabel ="Survival Rate",)parch_survival =@chain train_clean begin DF.groupby(:Parch) DF.combine(:Survived => Stats.mean =>:survival_rate,:Survived => Stats.std =>:std_dev,:Survived => length =>:count )endparch_survival.std_error = parch_survival.std_dev ./sqrt.(parch_survival.count)Makie.scatterlines!(ax10, parch_survival.Parch, parch_survival.survival_rate, color ="#2196F3", linewidth =3, markersize =8)error = Makie.errorbars!(ax10, parch_survival.Parch, parch_survival.survival_rate, parch_survival.std_error, color ="blue", linewidth =2, whiskerwidth =8)Makie.Legend(fig10[1, 2], [Makie.PolyElement(color ="#2196F3"), Makie.PolyElement(color ="blue")], ["Survival Rate", "Standard Error"],"Legend")fig10```**Passenger who traveled in big groups with parents/children had less survival rate than other passengers.**```{julia}# sns.factorplot(x = "SibSp", y = "Survived", data = train,kind = "point",size = 8)# plt.title('Factorplot of Sibilings/Spouses survived', fontsize = 25)# plt.subplots_adjust(top=0.85)fig11 = Makie.Figure(size = (800, 600))ax11 = Makie.Axis(fig11[1, 1], title ="Siblings/Spouses Survival Rate", xlabel ="Number of Siblings/Spouses", ylabel ="Survival Rate",)sibsp_survival =@chain train_clean begin DF.groupby(:SibSp) DF.combine(:Survived => Stats.mean =>:survival_rate,:Survived => Stats.std =>:std_dev,:Survived => length =>:count )endsibsp_survival.std_error = sibsp_survival.std_dev ./sqrt.(sibsp_survival.count)Makie.scatterlines!(ax11, sibsp_survival.SibSp, sibsp_survival.survival_rate, color ="#2196F3", linewidth =3, markersize =8)error = Makie.errorbars!(ax11, sibsp_survival.SibSp, sibsp_survival.survival_rate, sibsp_survival.std_error, color ="blue", linewidth =2, whiskerwidth =8)Makie.Legend(fig11[1, 2], [Makie.PolyElement(color ="#2196F3"), Makie.PolyElement(color ="blue")], ["Survival Rate", "Standard Error"],"Legend")fig11```**While, passenger who traveled in small groups with sibilings/spouses had better changes of survivint than other passengers.**```{julia}train.Sex = [sex =="female" ? 0:1 for sex in train.Sex]test.Sex = [sex =="female" ? 0:1 for sex in test.Sex];```# Part 4: Statistical Overview<a id="statisticaloverview"></a>---**Train info**```{julia}DF.describe(train)``````{julia}categorical_cols = [col for col innames(train) if eltype(train[!, col]) <: Union{String, AbstractString}]DF.describe(train[!, categorical_cols])``````{julia}survived_summary =@chain train begin DF.select(DF.names(train, Number)...) DF.groupby(:Survived) DF.combine(DF.All() .=> Stats.mean)end``````{julia}sex_summary =@chain train begin DF.select(DF.names(train, Number)...) DF.groupby(:Sex) DF.combine(DF.All() .=> Stats.mean)end``````{julia}class_summary =@chain train begin DF.select(DF.names(train, Number)...) DF.groupby(:Pclass) DF.combine(DF.All() .=> Stats.mean)end```I have gathered a small summary from the statistical overview above. Let's see what they are...- This train data set has 891 raw and 9 columns.- only 38% passenger survived during that tragedy.- ~74% female passenger survived, while only ~19% male passenger survived.- ~63% first class passengers survived, while only 24% lower class passenger survived.## 4a. Correlation Matrix and Heatmap<a id="heatmap"></a>---### Correlations```{julia}numeric_cols = DF.select(train, DF.names(train, Number)...)corr_matrix = Stats.cor(Stats.Matrix(numeric_cols[:, DF.Not(:Survived)]), numeric_cols.Survived)survived_corr = DF.DataFrame( Variable = DF.names(numeric_cols[:, DF.Not(:Survived)]), Correlation =abs.(corr_matrix[:, 1]))```** Sex is the most important correlated feature with _Survived(dependent variable)_ feature followed by Pclass.**```{julia}#| _cell_guid: 3e9fdd2e-f081-48ad-9c0f-afa475b15dfe#| _kg_hide-input: true#| _uuid: c3212c222341c250aacee47c43b1a023b9b65857#| execution: {iopub.execute_input: '2021-06-26T16:35:15.408424Z', iopub.status.busy: '2021-06-26T16:35:15.407893Z', iopub.status.idle: '2021-06-26T16:35:15.421826Z', shell.execute_reply: '2021-06-26T16:35:15.42092Z', shell.execute_reply.started: '2021-06-26T16:35:15.408231Z'}## get the most important variables.corr = train.corr()**2corr.Survived.sort_values(ascending=False)```**Squaring the correlation feature not only gives on positive correlations but also amplifies the relationships.**```{julia}importnumpy as npmask = np.zeros_like(train.corr(), dtype=np.bool)mask[np.triu_indices_from(mask)] = Truesns.set_style('whitegrid')plt.subplots(figsize = (15,12))sns.heatmap(train.corr(), annot=True, mask = mask, cmap ='RdBu', ## in order to reverse the bar replace "RdBu" with "RdBu_r" linewidths=.9, linecolor='white', fmt='.2g', center =0, square=True)plt.title("Correlations Among Features", y =1.03,fontsize =20, pad =40);```#### Positive Correlation Features:- Fare and Survived: 0.26#### Negative Correlation Features:- Fare and Pclass: -0.6- Sex and Survived: -0.55- Pclass and Survived: -0.33**So, Let's analyze these correlations a bit.** We have found some moderately strong relationships between different features. There is a definite positive correlation between Fare and Survived rated. This relationship reveals that the passenger who paid more money for their ticket were more likely to survive. This theory aligns with one other correlation which is the correlation between Fare and Pclass(-0.6). This relationship can be explained by saying that first class passenger(1) paid more for fare then second class passenger(2), similarly second class passenger paid more than the third class passenger(3). This theory can also be supported by mentioning another Pclass correlation with our dependent variable, Survived. The correlation between Pclass and Survived is -0.33. This can also be explained by saying that first class passenger had a better chance of surviving than the second or the third and so on.However, the most significant correlation with our dependent variable is the Sex variable, which is the info on whether the passenger was male or female. This negative correlation with a magnitude of -0.54 which points towards some undeniable insights. Let's do some statistics to see how statistically significant this correlation is.## 4b. Statistical Test for Correlation<a id="statistical_test"></a>---Statistical tests are the scientific way to prove the validation of theories. In any case, when we look at the data, we seem to have an intuitive understanding of where data is leading us. However, when we do statistical tests, we get a scientific or mathematical perspective of how significant these results are. Let's apply some of these methods and see how we are doing with our predictions.### Hypothesis Testing OutlineA hypothesis test compares the mean of a control group and experimental group and tries to find out whether the two sample means are different from each other and if they are different, how significant that difference is.A **hypothesis test** usually consists of multiple parts:1. Formulate a well-developed research problem or question: The hypothesis test usually starts with a concrete and well-developed researched problem. We need to ask the right question that can be answered using statistical analysis.2. **The null hypothesis($H_0$) and Alternating hypothesis($H_1$)**: > - The **null hypothesis($H_0$)** is something that is assumed to be true. It is the status quo. In a null hypothesis, the observations are the result of pure chance. When we set out to experiment, we form the null hypothesis by saying that there is no difference between the means of the control group and the experimental group. > - An **Alternative hypothesis($H_A$)** is a claim and the opposite of the null hypothesis. It is going against the status quo. In an alternative theory, the observations show a real effect combined with a component of chance variation.3. Determine the **test statistic**: test statistic can be used to assess the truth of the null hypothesis. Depending on the standard deviation we either use t-statistics or z-statistics. In addition to that, we want to identify whether the test is a one-tailed test or two-tailed test. [This](https://support.minitab.com/en-us/minitab/18/help-and-how-to/statistics/basic-statistics/supporting-topics/basics/null-and-alternative-hypotheses/) article explains it pretty well. [This](https://stattrek.com/hypothesis-test/hypothesis-testing.aspx) article is pretty good as well.4. Specify a **Significance level** and **Confidence Interval**: The significance level($\alpha$) is the probability of rejecting a null hypothesis when it is true. In other words, we are **_comfortable/confident_** with rejecting the null hypothesis a significant amount of times even though it is true. This considerable amount is our Significant level. In addition to that, Significance level is one minus our Confidence interval. For example, if we say, our significance level is 5%, then our confidence interval would be (1 - 0.05) = 0.95 or 95%.5. Compute the **T-Statistics/Z-Statistics**: Computing the t-statistics follows a simple equation. This equation slightly differs depending on one sample test or two sample test6. Compute the **P-value**: P-value is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis is correct. The p-value is known to be unintuitive, and even many professors are known to explain it wrong. I think this [video](https://www.youtube.com/watch?v=E4KCfcVwzyw) explains the p-value well. **The smaller the P-value, the stronger the evidence against the null hypothesis.**7. **Describe the result and compare the p-value with the significance value($\alpha$)**: If p<=$\alpha$, then the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid. However if the p> $\alpha$, we say that, we fail to reject the null hypothesis. Even though this sentence is grammatically wrong, it is logically right. We never accept the null hypothesis just because we are doing the statistical test with sample data points.We will follow each of these steps above to do your hypothesis testing below.P.S. Khan Academy has a set of videos that I think are intuative and helped me understand conceptually.---### Hypothesis testing for Titanic#### Formulating a well developed researched question:Regarding this dataset, we can formulate the null hypothesis and alternative hypothesis by asking the following questions.> - **Is there a significant difference in the mean sex between the passenger who survived and passenger who did not survive?**.> - **Is there a substantial difference in the survival rate between the male and female passengers?**#### The Null Hypothesis and The Alternative Hypothesis:We can formulate our hypothesis by asking questions differently. However, it is essential to understand what our end goal is. Here our dependent variable or target variable is **Survived**. Therefore, we say> ** Null Hypothesis($H_0$):** There is no difference in the survival rate between the male and female passengers. or the mean difference between male and female passenger in the survival rate is zero. > ** Alternative Hypothesis($H_A$):** There is a difference in the survival rate between the male and female passengers. or the mean difference in the survival rate between male and female is not zero.Onc thing we can do is try to set up the Null and Alternative Hypothesis in such way that, when we do our t-test, we can choose to do one tailed test. According to [this](https://support.minitab.com/en-us/minitab/18/help-and-how-to/statistics/basic-statistics/supporting-topics/basics/null-and-alternative-hypotheses/) article, one-tailed tests are more powerful than two-tailed test. In addition to that, [this](https://www.youtube.com/watch?v=5NcMFlrnYp8&list=PLIeGtxpvyG-LrjxQ60pxZaimkaKKs0zGF) video is also quite helpful understanding these topics. with this in mind we can update/modify our null and alternative hypothesis. Let's see how we can rewrite this..> **Null Hypothesis(H0):** male mean is greater or equal to female mean.> **Alternative Hypothesis(H1):** male mean is less than female mean.#### Determine the test statistics:> This will be a two-tailed test since the difference between male and female passenger in the survival rate could be higher or lower than 0.> Since we do not know the standard deviation($\sigma$) and n is small, we will use the t-distribution.#### Specify the significance level:> Specifying a significance level is an important step of the hypothesis test. It is an ultimate balance between type 1 error and type 2 error. We will discuss more in-depth about those in another lesson. For now, we have decided to make our significance level($\alpha$) = 0.05. So, our confidence interval or non-rejection region would be (1 - $\alpha$)=(1-0.05) = 95%.#### Computing T-statistics and P-value:Let's take a random sample and see the difference.```{julia}#| _kg_hide-input: true#| _uuid: abd034cffc591bf1ef2b4a8ed3e5a65eb133d61e#| execution: {iopub.execute_input: '2021-06-26T16:35:15.771771Z', iopub.status.busy: '2021-06-26T16:35:15.771345Z', iopub.status.idle: '2021-06-26T16:35:15.783362Z', shell.execute_reply: '2021-06-26T16:35:15.782301Z', shell.execute_reply.started: '2021-06-26T16:35:15.771603Z'}male_mean = train[train['Sex'] ==1].Survived.mean()female_mean = train[train['Sex'] ==0].Survived.mean()print ("Male survival mean: "+str(male_mean))print ("female survival mean: "+str(female_mean))print ("The mean difference between male and female survival rate: "+str(female_mean - male_mean))```Now, we have to understand that those two means are not **the population mean ($\bar{\mu}$)**. _The population mean is a statistical term statistician uses to indicate the actual average of the entire group. The group can be any gathering of multiple numbers such as animal, human, plants, money, stocks._ For example, To find the age population mean of Bulgaria; we will have to account for every single person's age and take their age. Which is almost impossible and if we were to go that route; there is no point of doing statistics in the first place. Therefore we approach this problem using sample sets. The idea of using sample set is that; if we take multiple samples of the same population and take the mean of them and put them in a distribution; eventually the distribution start to look more like a **normal distribution**. The more samples we take and the more sample means will be added and, the closer the normal distribution will reach towards population mean. This is where **Central limit theory** comes from. We will go more in depth of this topic later on.Going back to our dataset, like we are saying these means above are part of the whole story. We were given part of the data to train our machine learning models, and the other part of the data was held back for testing. Therefore, It is impossible for us at this point to know the population means of survival for male and females. Situation like this calls for a statistical approach. We will use the sampling distribution approach to do the test. let's take 50 random sample of male and female from our train data.```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:15.785359Z', iopub.status.busy: '2021-06-26T16:35:15.784861Z', iopub.status.idle: '2021-06-26T16:35:15.815921Z', shell.execute_reply: '2021-06-26T16:35:15.815302Z', shell.execute_reply.started: '2021-06-26T16:35:15.785103Z'}# separating male and female dataframe.importrandommale = train[train['Sex'] ==1]female = train[train['Sex'] ==0]## empty list for storing mean samplem_mean_samples = []f_mean_samples = []for i inrange(50): m_mean_samples.append(np.mean(random.sample(list(male['Survived']),50,))) f_mean_samples.append(np.mean(random.sample(list(female['Survived']),50,)))# Print them outprint (f"Male mean sample mean: {round(np.mean(m_mean_samples),2)}")print (f"Male mean sample mean: {round(np.mean(f_mean_samples),2)}")print (f"Difference between male and female mean sample mean: {round(np.mean(f_mean_samples) - np.mean(m_mean_samples),2)}")```H0: male mean is greater or equal to female mean<br>H1: male mean is less than female mean.According to the samples our male samples ($\bar{x}_m$) and female samples($\bar{x}_f$) mean measured difference is ~ 0.55(statistically this is called the point estimate of the male population mean and female population mean). keeping in mind that...- We randomly select 50 people to be in the male group and 50 people to be in the female group.- We know our sample is selected from a broader population(trainning set).- We know we could have totally ended up with a different random sample of males and females.---With all three points above in mind, how confident are we that, the measured difference is real or statistically significant? we can perform a **t-test** to evaluate that. When we perform a **t-test** we are usually trying to find out **an evidence of significant difference between population mean with hypothesized mean(1 sample t-test) or in our case difference between two population means(2 sample t-test).**The **t-statistics** is the measure of a degree to which our groups differ standardized by the variance of our measurements. In order words, it is basically the measure of signal over noise. Let us describe the previous sentence a bit more for clarification. I am going to use [this post](http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-is-a-t-test-and-why-is-it-like-telling-a-kid-to-clean-up-that-mess-in-the-kitchen) as reference to describe the t-statistics here.#### Calculating the t-statistics# $$t = \frac{\bar{x}-\mu}{\frac{S} {\sqrt{n}} }$$Here..- $\bar{x}$ is the sample mean.- $\mu$ is the hypothesized mean.- S is the standard deviation.- n is the sample size.1. Now, the denominator of this fraction $(\bar{x}-\mu)$ is basically the strength of the signal. where we calculate the difference between hypothesized mean and sample mean. If the mean difference is higher, then the signal is stronger.the numerator of this fraction ** ${S}/ {\sqrt{n}}$ ** calculates the amount of variation or noise of the data set. Here S is standard deviation, which tells us how much variation is there in the data. n is the sample size.So, according to the explanation above, the t-value or t-statistics is basically measures the strength of the signal(the difference) to the amount of noise(the variation) in the data and that is how we calculate the t-value in one sample t-test. However, in order to calculate between two sample population mean or in our case we will use the follow equation.# $$t = \frac{\bar{x}_M - \bar{x}_F}{\sqrt {s^2 (\frac{1}{n_M} + \frac{1}{n_F})}}$$This equation may seem too complex, however, the idea behind these two are similar. Both of them have the concept of signal/noise. The only difference is that we replace our hypothesis mean with another sample mean and the two sample sizes repalce one sample size.Here..- $\bar{x}_M$ is the mean of our male group sample measurements.- $ \bar{x}\_F$ is the mean of female group samples.- $ n_M$ and $n_F$ are the sample number of observations in each group.- $ S^2$ is the sample variance.It is good to have an understanding of what going on in the background. However, we will use **scipy.stats** to find the t-statistics.#### Compare P-value with $\alpha$> It looks like the p-value is very small compared to our significance level($\alpha$)of 0.05. Our observation sample is statistically significant. Therefore, our null hypothesis is ruled out, and our alternative hypothesis is valid, which is "**There is a significant difference in the survival rate between the male and female passengers."**# Part 5: Feature Engineering<a id="feature_engineering"></a>---Feature Engineering is exactly what its sounds like. Sometimes we want to create extra features from with in the features that we have, sometimes we want to remove features that are alike. Features engineering is the simple word for doing all those. It is important to remember that we will create new features in such ways that will not cause **multicollinearity(when there is a relationship among independent variables)** to occur.## name_length**_Creating a new feature "name_length" that will take the count of letters of each name_**```{julia}#| _cell_guid: d30d71c1-55bc-41c8-8536-9909d9f02538#| _kg_hide-input: true#| _uuid: cb17c6f59bb2123cbf2cbc9c282b4d70ee283a86#| execution: {iopub.execute_input: '2021-06-26T16:35:15.817993Z', iopub.status.busy: '2021-06-26T16:35:15.817477Z', iopub.status.idle: '2021-06-26T16:35:15.832377Z', shell.execute_reply: '2021-06-26T16:35:15.831471Z', shell.execute_reply.started: '2021-06-26T16:35:15.817745Z'}# Creating a new colomn with atrain['name_length'] = [len(i) for i in train.Name]test['name_length'] = [len(i) for i in test.Name]def name_length_group(size): a ='' if (size <=20): a ='short' elif (size <=35): a ='medium' elif (size <=45): a ='good' else: a ='long' return atrain['nLength_group'] = train['name_length'].map(name_length_group)test['nLength_group'] = test['name_length'].map(name_length_group)## Here "map" is python's built-in function.## "map" function basically takes a function and## returns an iterable list/tuple or in this case series.## However,"map" can also be used like map(function) e.g. map(name_length_group)## or map(function, iterable{list, tuple}) e.g. map(name_length_group, train[feature]]).## However, here we don't need to use parameter("size") for name_length_group because when we## used the map function like ".map" with a series before dot, we are basically hinting that series## and the iterable. This is similar to .append approach in python. list.append(a) meaning applying append on list.## cuts the column by given bins based on the range of name_length#group_names = ['short', 'medium', 'good', 'long']#train['name_len_group'] = pd.cut(train['name_length'], bins = 4, labels=group_names)```## title**Getting the title of each name as a new feature. **```{julia}#| _cell_guid: ded64d5f-43de-4a9e-b9c5-ec4d2869387a#| _kg_hide-input: true#| _uuid: 9c23229f7d06a1303a04b4a81c927453686ffec9#| execution: {iopub.execute_input: '2021-06-26T16:35:15.833953Z', iopub.status.busy: '2021-06-26T16:35:15.833501Z', iopub.status.idle: '2021-06-26T16:35:15.842414Z', shell.execute_reply: '2021-06-26T16:35:15.841468Z', shell.execute_reply.started: '2021-06-26T16:35:15.83376Z'}## get the title from the nametrain["title"] = [i.split('.')[0] for i in train.Name]train["title"] = [i.split(',')[1] for i in train.title]## Whenever we split like that, there is a good change that we will end up with while space around our string values. Let's check that.``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:15.84422Z', iopub.status.busy: '2021-06-26T16:35:15.843818Z', iopub.status.idle: '2021-06-26T16:35:15.853522Z', shell.execute_reply: '2021-06-26T16:35:15.852642Z', shell.execute_reply.started: '2021-06-26T16:35:15.84407Z'}print(train.title.unique())``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:15.855322Z', iopub.status.busy: '2021-06-26T16:35:15.854858Z', iopub.status.idle: '2021-06-26T16:35:15.86306Z', shell.execute_reply: '2021-06-26T16:35:15.86222Z', shell.execute_reply.started: '2021-06-26T16:35:15.855101Z'}## Let's fix thattrain.title = train.title.apply(lambda x: x.strip())``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:15.864826Z', iopub.status.busy: '2021-06-26T16:35:15.864362Z', iopub.status.idle: '2021-06-26T16:35:15.872663Z', shell.execute_reply: '2021-06-26T16:35:15.871817Z', shell.execute_reply.started: '2021-06-26T16:35:15.864612Z'}## We can also combile all three lines above for test set heretest['title'] = [i.split('.')[0].split(',')[1].strip() for i in test.Name]## However it is important to be able to write readable code, and the line above is not so readable.``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:15.874489Z', iopub.status.busy: '2021-06-26T16:35:15.873918Z', iopub.status.idle: '2021-06-26T16:35:15.896665Z', shell.execute_reply: '2021-06-26T16:35:15.895832Z', shell.execute_reply.started: '2021-06-26T16:35:15.874258Z'}## Let's replace some of the rare values with the keyword 'rare' and other word choice of our own.## train Datatrain["title"] = [i.replace('Ms', 'Miss') for i in train.title]train["title"] = [i.replace('Mlle', 'Miss') for i in train.title]train["title"] = [i.replace('Mme', 'Mrs') for i in train.title]train["title"] = [i.replace('Dr', 'rare') for i in train.title]train["title"] = [i.replace('Col', 'rare') for i in train.title]train["title"] = [i.replace('Major', 'rare') for i in train.title]train["title"] = [i.replace('Don', 'rare') for i in train.title]train["title"] = [i.replace('Jonkheer', 'rare') for i in train.title]train["title"] = [i.replace('Sir', 'rare') for i in train.title]train["title"] = [i.replace('Lady', 'rare') for i in train.title]train["title"] = [i.replace('Capt', 'rare') for i in train.title]train["title"] = [i.replace('the Countess', 'rare') for i in train.title]train["title"] = [i.replace('Rev', 'rare') for i in train.title]## Now in programming there is a term called DRY(Don't repeat yourself), whenever we are repeating## same code over and over again, there should be a light-bulb turning on in our head and make us think## to code in a way that is not repeating or dull. Let's write a function to do exactly what we## did in the code above, only not repeating and more interesting.``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:15.900031Z', iopub.status.busy: '2021-06-26T16:35:15.899771Z', iopub.status.idle: '2021-06-26T16:35:15.910036Z', shell.execute_reply: '2021-06-26T16:35:15.908929Z', shell.execute_reply.started: '2021-06-26T16:35:15.899989Z'}## we are writing a function that can help us modify title columndef name_converted(feature):""" This function helps modifying the title column """ result ='' if feature in ['the Countess','Capt','Lady','Sir','Jonkheer','Don','Major','Col', 'Rev', 'Dona', 'Dr']: result ='rare' elif feature in ['Ms', 'Mlle']: result ='Miss' elif feature =='Mme': result ='Mrs' else: result = feature return resulttest.title = test.title.map(name_converted)train.title = train.title.map(name_converted)``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:15.912187Z', iopub.status.busy: '2021-06-26T16:35:15.911644Z', iopub.status.idle: '2021-06-26T16:35:15.923512Z', shell.execute_reply: '2021-06-26T16:35:15.922507Z', shell.execute_reply.started: '2021-06-26T16:35:15.912136Z'}print(train.title.unique())print(test.title.unique())```## family_size**_Creating a new feature called "family_size"._**```{julia}#| _cell_guid: 7083a7e7-d1d5-4cc1-ad67-c454b139f5f1#| _kg_hide-input: true#| _uuid: cdfd54429cb235dd3b73535518950b2e515e54f2#| execution: {iopub.execute_input: '2021-06-26T16:35:15.925581Z', iopub.status.busy: '2021-06-26T16:35:15.925033Z', iopub.status.idle: '2021-06-26T16:35:15.933955Z', shell.execute_reply: '2021-06-26T16:35:15.933137Z', shell.execute_reply.started: '2021-06-26T16:35:15.925315Z'}## Family_size seems like a good feature to createtrain['family_size'] = train.SibSp + train.Parch+1test['family_size'] = test.SibSp + test.Parch+1``````{julia}#| _cell_guid: 3d471d07-7735-4aab-8b26-3f26e481dc49#| _kg_hide-input: true#| _uuid: 2e23467af7a2e85fcaa06b52b303daf2e5e44250#| execution: {iopub.execute_input: '2021-06-26T16:35:15.935971Z', iopub.status.busy: '2021-06-26T16:35:15.935422Z', iopub.status.idle: '2021-06-26T16:35:15.942647Z', shell.execute_reply: '2021-06-26T16:35:15.941882Z', shell.execute_reply.started: '2021-06-26T16:35:15.935671Z'}## bin the family size.def family_group(size):""" This funciton groups(loner, small, large) family based on family size """ a ='' if (size <=1): a ='loner' elif (size <=4): a ='small' else: a ='large' return a``````{julia}#| _cell_guid: 82f3cf5a-7e8d-42c3-a06b-56e17e890358#| _kg_hide-input: true#| _uuid: 549239812f919f5348da08db4264632d2b21b587#| execution: {iopub.execute_input: '2021-06-26T16:35:15.944511Z', iopub.status.busy: '2021-06-26T16:35:15.94417Z', iopub.status.idle: '2021-06-26T16:35:15.95416Z', shell.execute_reply: '2021-06-26T16:35:15.953395Z', shell.execute_reply.started: '2021-06-26T16:35:15.944448Z'}## apply the family_group function in family_sizetrain['family_group'] = train['family_size'].map(family_group)test['family_group'] = test['family_size'].map(family_group)```## is_alone```{julia}#| _cell_guid: 298b28d6-75a7-4e49-b1c3-7755f1727327#| _kg_hide-input: true#| _uuid: 45315bb62f69e94e66109e7da06c6c5ade578398#| execution: {iopub.execute_input: '2021-06-26T16:35:15.956031Z', iopub.status.busy: '2021-06-26T16:35:15.955569Z', iopub.status.idle: '2021-06-26T16:35:15.964779Z', shell.execute_reply: '2021-06-26T16:35:15.963853Z', shell.execute_reply.started: '2021-06-26T16:35:15.955855Z'}train['is_alone'] = [1 if i<2 else 0 for i in train.family_size]test['is_alone'] = [1 if i<2 else 0 for i in test.family_size]```## ticket```{julia}#| _cell_guid: 352c794d-728d-44de-9160-25da7abe0c06#| _kg_hide-input: true#| _uuid: 5b99e1f7d7757f11e6dd6dbc627f3bd6e2fbd874#| execution: {iopub.execute_input: '2021-06-26T16:35:15.966936Z', iopub.status.busy: '2021-06-26T16:35:15.9664Z', iopub.status.idle: '2021-06-26T16:35:15.97799Z', shell.execute_reply: '2021-06-26T16:35:15.976969Z', shell.execute_reply.started: '2021-06-26T16:35:15.966816Z'}train.Ticket.value_counts().sample(10)```I have yet to figureout how to best manage ticket feature. So, any suggestion would be truly appreciated. For now, I will get rid off the ticket feature.```{julia}#| _kg_hide-input: true#| _uuid: d23d451982f0cbe44976c2eacafb726d816e9195#| execution: {iopub.execute_input: '2021-06-26T16:35:15.979613Z', iopub.status.busy: '2021-06-26T16:35:15.979155Z', iopub.status.idle: '2021-06-26T16:35:15.989456Z', shell.execute_reply: '2021-06-26T16:35:15.988913Z', shell.execute_reply.started: '2021-06-26T16:35:15.97941Z'}train.drop(['Ticket'], axis=1, inplace=True)test.drop(['Ticket'], axis=1, inplace=True)```## calculated_fare```{julia}#| _cell_guid: adaa30fe-cb0f-4666-bf95-505f1dcce188#| _kg_hide-input: true#| _uuid: 9374a6357551a7551e71731d72f5ceb3144856df#| execution: {iopub.execute_input: '2021-06-26T16:35:15.991841Z', iopub.status.busy: '2021-06-26T16:35:15.991313Z', iopub.status.idle: '2021-06-26T16:35:15.999545Z', shell.execute_reply: '2021-06-26T16:35:15.998734Z', shell.execute_reply.started: '2021-06-26T16:35:15.991562Z'}## Calculating fare based on family size.train['calculated_fare'] = train.Fare/train.family_sizetest['calculated_fare'] = test.Fare/test.family_size```Some people have travelled in groups like family or friends. It seems like Fare column kept a record of the total fare rather than the fare of individual passenger, therefore calculated fare will be much handy in this situation.## fare_group```{julia}#| _cell_guid: 8c33b78c-14cb-4cc2-af0f-65079a741570#| _kg_hide-input: true#| _uuid: 35685a6ca28651eab389c4673c21da2ea5ba4187#| execution: {iopub.execute_input: '2021-06-26T16:35:16.001667Z', iopub.status.busy: '2021-06-26T16:35:16.001088Z', iopub.status.idle: '2021-06-26T16:35:16.012304Z', shell.execute_reply: '2021-06-26T16:35:16.011542Z', shell.execute_reply.started: '2021-06-26T16:35:16.00135Z'}def fare_group(fare):""" This function creates a fare group based on the fare provided """ a='' if fare <=4: a ='Very_low' elif fare <=10: a ='low' elif fare <=20: a ='mid' elif fare <=45: a ='high' else: a ="very_high" return atrain['fare_group'] = train['calculated_fare'].map(fare_group)test['fare_group'] = test['calculated_fare'].map(fare_group)#train['fare_group'] = pd.cut(train['calculated_fare'], bins = 4, labels=groups)```Fare group was calculated based on <i>calculated_fare</i>. This can further help our cause.## PassengerIdIt seems like <i>PassengerId</i> column only works as an id in this dataset without any significant effect on the dataset. Let's drop it.```{julia}#| _uuid: dadea67801cf5b56a882aa96bb874a4afa0e0bec#| execution: {iopub.execute_input: '2021-06-26T16:35:16.014434Z', iopub.status.busy: '2021-06-26T16:35:16.013951Z', iopub.status.idle: '2021-06-26T16:35:16.025524Z', shell.execute_reply: '2021-06-26T16:35:16.024631Z', shell.execute_reply.started: '2021-06-26T16:35:16.014266Z'}train.drop(['PassengerId'], axis=1, inplace=True)test.drop(['PassengerId'], axis=1, inplace=True)```## Creating dummy variablesYou might be wondering what is a dummy variable?Dummy variable is an important **prepocessing machine learning step**. Often times Categorical variables are an important features, which can be the difference between a good model and a great model. While working with a dataset, having meaningful value for example, "male" or "female" instead of 0's and 1's is more intuitive for us. However, machines do not understand the value of categorical values, for example, in this dataset we have gender male or female, algorithms do not accept categorical variables as input. In order to feed data in a machine learning model, we```{julia}#| _cell_guid: 9243ac8c-be44-46d0-a0ca-ee5f19b89bd4#| _kg_hide-input: true#| _uuid: 7b8db3930fb1bfb91db16686223dfc6d8e77744d#| execution: {iopub.execute_input: '2021-06-26T16:35:16.027132Z', iopub.status.busy: '2021-06-26T16:35:16.026701Z', iopub.status.idle: '2021-06-26T16:35:16.059319Z', shell.execute_reply: '2021-06-26T16:35:16.058745Z', shell.execute_reply.started: '2021-06-26T16:35:16.027081Z'}train = pd.get_dummies(train, columns=['title',"Pclass", 'Cabin','Embarked','nLength_group', 'family_group', 'fare_group'], drop_first=False)test = pd.get_dummies(test, columns=['title',"Pclass",'Cabin','Embarked','nLength_group', 'family_group', 'fare_group'], drop_first=False)train.drop(['family_size','Name', 'Fare','name_length'], axis=1, inplace=True)test.drop(['Name','family_size',"Fare",'name_length'], axis=1, inplace=True)```## ageAs I promised before, we are going to use Random forest regressor in this section to predict the missing age values. Let's do it```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:16.061141Z', iopub.status.busy: '2021-06-26T16:35:16.060714Z', iopub.status.idle: '2021-06-26T16:35:16.084728Z', shell.execute_reply: '2021-06-26T16:35:16.083793Z', shell.execute_reply.started: '2021-06-26T16:35:16.060961Z'}train.head()``````{julia}#| _kg_hide-input: true#| _uuid: 9597c320c3db4db5e5c28980a28abaae7281bc61#| execution: {iopub.execute_input: '2021-06-26T16:35:16.086463Z', iopub.status.busy: '2021-06-26T16:35:16.086001Z', iopub.status.idle: '2021-06-26T16:35:16.096908Z', shell.execute_reply: '2021-06-26T16:35:16.095838Z', shell.execute_reply.started: '2021-06-26T16:35:16.086235Z'}## rearranging the columns so that I can easily use the dataframe to predict the missing age values.train = pd.concat([train[["Survived", "Age", "Sex","SibSp","Parch"]], train.loc[:,"is_alone":]], axis=1)test = pd.concat([test[["Age", "Sex"]], test.loc[:,"SibSp":]], axis=1)``````{julia}#| _kg_hide-input: true#| _uuid: 91662e7b63c2361fdcf3215f130b3895154ad92d#| execution: {iopub.execute_input: '2021-06-26T16:35:16.098683Z', iopub.status.busy: '2021-06-26T16:35:16.098263Z', iopub.status.idle: '2021-06-26T16:35:22.704889Z', shell.execute_reply: '2021-06-26T16:35:22.704165Z', shell.execute_reply.started: '2021-06-26T16:35:16.098504Z'}## Importing RandomForestRegressorfrom sklearn.ensemble importRandomForestRegressor## writing a function that takes a dataframe with missing values and outputs it by filling the missing values.def completing_age(df):## gettting all the features except survived age_df = df.loc[:,"Age":] temp_train = age_df.loc[age_df.Age.notnull()] ## df with age values temp_test = age_df.loc[age_df.Age.isnull()] ## df without age values y = temp_train.Age.values ## setting target variables(age) in y x = temp_train.loc[:, "Sex":].values rfr =RandomForestRegressor(n_estimators=1500, n_jobs=-1) rfr.fit(x, y) predicted_age = rfr.predict(temp_test.loc[:, "Sex":]) df.loc[df.Age.isnull(), "Age"] = predicted_age return df## Implementing the completing_age function in both train and test dataset.completing_age(train)completing_age(test);```Let's take a look at the histogram of the age column.```{julia}#| _kg_hide-input: true#| _uuid: 8fc55e4670061d46dab3cc6585b3cc71eb996868#| execution: {iopub.execute_input: '2021-06-26T16:35:22.708567Z', iopub.status.busy: '2021-06-26T16:35:22.708283Z', iopub.status.idle: '2021-06-26T16:35:23.194075Z', shell.execute_reply: '2021-06-26T16:35:23.193419Z', shell.execute_reply.started: '2021-06-26T16:35:22.708515Z'}## Let's look at the hisplt.subplots(figsize = (22,10),)sns.distplot(train.Age, bins =100, kde = True, rug = False, norm_hist=False);```## age_groupWe can create a new feature by grouping the "Age" column```{julia}#| _cell_guid: 3140c968-6755-42ec-aa70-d30c0acede1e#| _kg_hide-input: true#| _uuid: c3bd77bb4d9d5411aa696a605be127db181d2a67#| execution: {iopub.execute_input: '2021-06-26T16:35:23.196215Z', iopub.status.busy: '2021-06-26T16:35:23.195696Z', iopub.status.idle: '2021-06-26T16:35:23.219708Z', shell.execute_reply: '2021-06-26T16:35:23.218664Z', shell.execute_reply.started: '2021-06-26T16:35:23.195943Z'}## create bins for agedef age_group_fun(age):""" This function creates a bin for age """ a ='' if age <=1: a ='infant' elif age <=4: a ='toddler' elif age <=13: a ='child' elif age <=18: a ='teenager' elif age <=35: a ='Young_Adult' elif age <=45: a ='adult' elif age <=55: a ='middle_aged' elif age <=65: a ='senior_citizen' else: a ='old' return a## Applying "age_group_fun" function to the "Age" column.train['age_group'] = train['Age'].map(age_group_fun)test['age_group'] = test['Age'].map(age_group_fun)## Creating dummies for "age_group" feature.train = pd.get_dummies(train,columns=['age_group'], drop_first=True)test = pd.get_dummies(test,columns=['age_group'], drop_first=True);```<div class="alert alert-danger"><h1>Need to paraphrase this section</h1><h2>Feature Selection</h2><h3>Feature selection is an important part of machine learning models. There are many reasons why we use feature selection.</h3> <ul> <li>Simple models are easier to interpret. People who acts according to model results have a better understanding of the model.</li> <li>Shorter training times.</li> <li>Enhanced generalisation by reducing overfitting. </li> <li>Easier to implement by software developers> model production.</li> <ul> <li>As Data Scientists we need to remember no to creating models with too many variables since it might overwhelm production engineers.</li> </ul> <li>Reduced risk of data errors during model use</li> <li>Data redundancy</li></ul></div># Part 6: Pre-Modeling Tasks## 6a. Separating dependent and independent variables<a id="dependent_independent"></a>---Before we apply any machine learning models, It is important to separate dependent and independent variables. Our dependent variable or target variable is something that we are trying to find, and our independent variable is the features we use to find the dependent variable. The way we use machine learning algorithm in a dataset is that we train our machine learning model by specifying independent variables and dependent variable. To specify them, we need to separate them from each other, and the code below does just that.P.S. In our test dataset, we do not have a dependent variable feature. We are to predict that using machine learning models.```{julia}#| _cell_guid: dcb0934f-8e3f-40b6-859e-abf70b0b074e#| _kg_hide-input: true#| _uuid: 607db6be6dfacc7385e5adcc0feeee28c50c99c5#| execution: {iopub.execute_input: '2021-06-26T16:35:23.221875Z', iopub.status.busy: '2021-06-26T16:35:23.221297Z', iopub.status.idle: '2021-06-26T16:35:23.229845Z', shell.execute_reply: '2021-06-26T16:35:23.228853Z', shell.execute_reply.started: '2021-06-26T16:35:23.221578Z'}# separating our independent and dependent variableX = train.drop(['Survived'], axis =1)y = train["Survived"]```## 6b. Splitting the training data<a id="split_training_data" ></a>---There are multiple ways of splitting data. They are...- train_test_split.- cross_validation.We have separated dependent and independent features; We have separated train and test data. So, why do we still have to split our training data? If you are curious about that, I have the answer. For this competition, when we train the machine learning algorithms, we use part of the training set usually two-thirds of the train data. Once we train our algorithm using 2/3 of the train data, we start to test our algorithms using the remaining data. If the model performs well we dump our test data in the algorithms to predict and submit the competition. The code below, basically splits the train data into 4 parts, **X_train**, **X_test**, **y_train**, **y_test**.- **X_train** and **y_train** first used to train the algorithm.- then, **X_test** is used in that trained algorithms to predict **outcomes. **- Once we get the **outcomes**, we compare it with **y_test**By comparing the **outcome** of the model with **y_test**, we can determine whether our algorithms are performing well or not. As we compare we use confusion matrix to determine different aspects of model performance.P.S. When we use cross validation it is important to remember not to use **X_train, X_test, y_train and y_test**, rather we will use **X and y**. I will discuss more on that.```{julia}#| _cell_guid: 348a5be2-5f4f-4c98-93a3-7352b6060ef4#| _kg_hide-input: true#| _uuid: 41b70e57f8e03da9910c20af89a9fa4a2aaea85b#| execution: {iopub.execute_input: '2021-06-26T16:35:23.231964Z', iopub.status.busy: '2021-06-26T16:35:23.23135Z', iopub.status.idle: '2021-06-26T16:35:23.240022Z', shell.execute_reply: '2021-06-26T16:35:23.239414Z', shell.execute_reply.started: '2021-06-26T16:35:23.231633Z'}from sklearn.model_selection importtrain_test_splitX_train, X_test, y_train, y_test =train_test_split(X, y,test_size =.33, random_state=0)``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:23.242734Z', iopub.status.busy: '2021-06-26T16:35:23.24208Z', iopub.status.idle: '2021-06-26T16:35:23.250654Z', shell.execute_reply: '2021-06-26T16:35:23.249893Z', shell.execute_reply.started: '2021-06-26T16:35:23.242373Z'}len(X_train)``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:23.260997Z', iopub.status.busy: '2021-06-26T16:35:23.260779Z', iopub.status.idle: '2021-06-26T16:35:23.265643Z', shell.execute_reply: '2021-06-26T16:35:23.264688Z', shell.execute_reply.started: '2021-06-26T16:35:23.260954Z'}len(X_test)```## 6c. Feature Scaling<a id="feature_scaling" ></a>---Feature scaling is an important concept of machine learning models. Often times a dataset contain features highly varying in magnitude and unit. For some machine learning models, it is not a problem. However, for many other ones, its quite a problem. Many machine learning algorithms uses euclidian distances to calculate the distance between two points, it is quite a problem. Let's again look at a the sample of the **train** dataset below.```{julia}#| _kg_hide-input: true#| _uuid: d788baa4b88106afe5b30c769a6c85a1d67a5d6c#| execution: {iopub.execute_input: '2021-06-26T16:35:23.26761Z', iopub.status.busy: '2021-06-26T16:35:23.267136Z', iopub.status.idle: '2021-06-26T16:35:23.295264Z', shell.execute_reply: '2021-06-26T16:35:23.294322Z', shell.execute_reply.started: '2021-06-26T16:35:23.267383Z'}train.sample(5)```Here **Age** and **Calculated_fare** is much higher in magnitude compared to others machine learning features. This can create problems as many machine learning models will get confused thinking **Age** and **Calculated_fare** have higher weight than other features. Therefore, we need to do feature scaling to get a better result.There are multiple ways to do feature scaling.<ul> <li><b>MinMaxScaler</b>-Scales the data using the max and min values so that it fits between 0 and 1.</li> <li><b>StandardScaler</b>-Scales the data so that it has mean 0 and variance of 1.</li> <li><b>RobustScaler</b>-Scales the data similary to Standard Scaler, but makes use of the median and scales using the interquertile range so as to aviod issues with large outliers.</b> </ul>I will discuss more on that in a different kernel. For now we will use <b>Standard Scaler</b> to feature scale our dataset.P.S. I am showing a sample of both before and after so that you can see how scaling changes the dataset.<h3><font color="$5831bc" face="Comic Sans MS">Before Scaling</font></h3>```{julia}#| _kg_hide-input: true#| _uuid: c4011a767b1d846f2866b4573d1d6d116afe8427#| execution: {iopub.execute_input: '2021-06-26T16:35:23.297022Z', iopub.status.busy: '2021-06-26T16:35:23.296548Z', iopub.status.idle: '2021-06-26T16:35:23.319251Z', shell.execute_reply: '2021-06-26T16:35:23.318338Z', shell.execute_reply.started: '2021-06-26T16:35:23.296792Z'}headers = X_train.columnsX_train.head()``````{julia}#| _cell_guid: 5c89c54b-7f5a-4e31-9e8f-58726cef5eab#| _kg_hide-input: true#| _uuid: 182b849ba7f2b311e919cdbf83970b97736e9d98#| execution: {iopub.execute_input: '2021-06-26T16:35:23.320979Z', iopub.status.busy: '2021-06-26T16:35:23.320476Z', iopub.status.idle: '2021-06-26T16:35:23.331478Z', shell.execute_reply: '2021-06-26T16:35:23.33067Z', shell.execute_reply.started: '2021-06-26T16:35:23.320738Z'}# Feature Scaling## We will be using standardscaler to transformfrom sklearn.preprocessing importStandardScalerst_scale =StandardScaler()## transforming "train_x"X_train = st_scale.fit_transform(X_train)## transforming "test_x"X_test = st_scale.transform(X_test)## transforming "The testset"#test = st_scale.transform(test)```<h3><font color="#5831bc" face="Comic Sans MS">After Scaling</font></h3>```{julia}#| _kg_hide-input: true#| _uuid: fc6f031833ac9e2734aa7b3a2373b667679c6b2f#| execution: {iopub.execute_input: '2021-06-26T16:35:23.333531Z', iopub.status.busy: '2021-06-26T16:35:23.333111Z', iopub.status.idle: '2021-06-26T16:35:23.359161Z', shell.execute_reply: '2021-06-26T16:35:23.358554Z', shell.execute_reply.started: '2021-06-26T16:35:23.333347Z'}pd.DataFrame(X_train, columns=headers).head()```You can see how the features have transformed above.# Part 7: Modeling the Data<a id="modelingthedata"></a>---In the previous versions of this kernel, I thought about explaining each model before applying it. However, this process makes this kernel too lengthy to sit and read at one go. Therefore I have decided to break this kernel down and explain each algorithm in a different kernel and add the links here. If you like to review logistic regression, please click [here](https://www.kaggle.com/masumrumi/logistic-regression-with-titanic-dataset).```{julia}#| _cell_guid: 0c8b0c41-6738-4689-85b0-b83a16e46ab9#| _uuid: 09140be1a71e37b441a16951a82747462b767e6e#| execution: {iopub.execute_input: '2021-06-26T16:35:23.361067Z', iopub.status.busy: '2021-06-26T16:35:23.360637Z', iopub.status.idle: '2021-06-26T16:35:23.383762Z', shell.execute_reply: '2021-06-26T16:35:23.383049Z', shell.execute_reply.started: '2021-06-26T16:35:23.360889Z'}# import LogisticRegression model in python.from sklearn.linear_model importLogisticRegressionfrom sklearn.metrics importmean_absolute_error, accuracy_score## call on the model objectlogreg =LogisticRegression(solver='liblinear', penalty='l1',random_state =42 )## fit the model with "train_x" and "train_y"logreg.fit(X_train,y_train)## Once the model is trained we want to find out how well the model is performing, so we test the model.## we use "X_test" portion of the data(this data was not used to fit the model) to predict model outcome.y_pred = logreg.predict(X_test)## Once predicted we save that outcome in "y_pred" variable.## Then we compare the predicted value( "y_pred") and actual value("test_y") to see how well our model is performing.```<h1><font color="#5831bc" face="Comic Sans MS">Evaluating a classification model</font></h1>There are multiple ways to evaluate a classification model.- Confusion Matrix.- ROC Curve- AUC Curve.## Confusion Matrix<b>Confusion matrix</b>, a table that <b>describes the performance of a classification model</b>. Confusion Matrix tells us how many our model predicted correctly and incorrectly in terms of binary/multiple outcome classes by comparing actual and predicted cases. For example, in terms of this dataset, our model is a binary one and we are trying to classify whether the passenger survived or not survived. we have fit the model using **X_train** and **y_train** and predicted the outcome of **X_test** in the variable **y_pred**. So, now we will use a confusion matrix to compare between **y_test** and **y_pred**. Let's do the confusion matrix.```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:23.385843Z', iopub.status.busy: '2021-06-26T16:35:23.385341Z', iopub.status.idle: '2021-06-26T16:35:23.399434Z', shell.execute_reply: '2021-06-26T16:35:23.398674Z', shell.execute_reply.started: '2021-06-26T16:35:23.385606Z'}from sklearn.metrics importclassification_report, confusion_matrix# printing confision matrixpd.DataFrame(confusion_matrix(y_test,y_pred),\ columns=["Predicted Not-Survived", "Predicted Survived"],\ index=["Not-Survived","Survived"] )```Our **y_test** has a total of 294 data points; part of the original train set that we splitted in order to evaluate our model. Each number here represents certain details about our model. If we were to think about this interms of column and raw, we could see that...- the first column is of data points that the machine predicted as not-survived.- the second column is of the statistics that the model predicted as survievd.- In terms of raws, the first raw indexed as "Not-survived" means that the value in that raw are actual statistics of not survived once.- and the "Survived" indexed raw are values that actually survived.Now you can see that the predicted not-survived and predicted survived sort of overlap with actual survived and actual not-survived. After all it is a matrix and we have some terminologies to call these statistics more specifically. Let's see what they are<ul style="list-style-type:square;"> <li><b>True Positive(TP)</b>: values that the model predicted as yes(survived) and is actually yes(survived).</li> <li><b>True Negative(TN)</b>: values that model predicted as no(not-survived) and is actually no(not-survived)</li> <li><b>False Positive(or Type I error)</b>: values that model predicted as yes(survived) but actually no(not-survived)</li> <li><b>False Negative(or Type II error)</b>: values that model predicted as no(not-survived) but actually yes(survived)</li></ul>For this dataset, whenever the model is predicting something as yes, it means the model is predicting that the passenger survived and for cases when the model predicting no; it means the passenger did not survive. Let's determine the value of all these terminologies above.<ul style="list-style-type:square;"> <li><b>True Positive(TP):87</b></li> <li><b>True Negative(TN):149</b></li> <li><b>False Positive(FP):28</b></li> <li><b>False Negative(FN):30</b></li></ul>From these four terminologies, we can compute many other rates that are used to evaluate a binary classifier.#### Accuracy:** Accuracy is the measure of how often the model is correct.**- (TP + TN)/total = (87+149)/294 = .8027We can also calculate accuracy score using scikit learn.```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:23.400652Z', iopub.status.busy: '2021-06-26T16:35:23.400403Z', iopub.status.idle: '2021-06-26T16:35:23.408635Z', shell.execute_reply: '2021-06-26T16:35:23.40776Z', shell.execute_reply.started: '2021-06-26T16:35:23.400604Z'}from sklearn.metrics importaccuracy_scoreaccuracy_score(y_test, y_pred)```**Misclassification Rate:** Misclassification Rate is the measure of how often the model is wrong\*\*- Misclassification Rate and Accuracy are opposite of each other.- Missclassification is equivalent to 1 minus Accuracy.- Misclassification Rate is also known as "Error Rate".> (FP + FN)/Total = (28+30)/294 = 0.19**True Positive Rate/Recall/Sensitivity:** How often the model predicts yes(survived) when it's actually yes(survived)?> TP/(TP+FN) = 87/(87+30) = 0.7435897435897436```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:23.410491Z', iopub.status.busy: '2021-06-26T16:35:23.410085Z', iopub.status.idle: '2021-06-26T16:35:23.418315Z', shell.execute_reply: '2021-06-26T16:35:23.417549Z', shell.execute_reply.started: '2021-06-26T16:35:23.410444Z'}from sklearn.metrics importrecall_scorerecall_score(y_test, y_pred)```**False Positive Rate:** How often the model predicts yes(survived) when it's actually no(not-survived)?> FP/(FP+TN) = 28/(28+149) = 0.15819209039548024**True Negative Rate/Specificity:** How often the model predicts no(not-survived) when it's actually no(not-survived)?- True Negative Rate is equivalent to 1 minus False Positive Rate.> TN/(TN+FP) = 149/(149+28) = 0.8418079096045198**Precision:** How often is it correct when the model predicts yes.> TP/(TP+FP) = 87/(87+28) = 0.7565217391304347```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:23.4204Z', iopub.status.busy: '2021-06-26T16:35:23.419791Z', iopub.status.idle: '2021-06-26T16:35:23.429679Z', shell.execute_reply: '2021-06-26T16:35:23.42864Z', shell.execute_reply.started: '2021-06-26T16:35:23.420242Z'}from sklearn.metrics importprecision_scoreprecision_score(y_test, y_pred)``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:23.431682Z', iopub.status.busy: '2021-06-26T16:35:23.431234Z', iopub.status.idle: '2021-06-26T16:35:23.44225Z', shell.execute_reply: '2021-06-26T16:35:23.441202Z', shell.execute_reply.started: '2021-06-26T16:35:23.43147Z'}from sklearn.metrics importclassification_report, balanced_accuracy_scoreprint(classification_report(y_test, y_pred))```we have our confusion matrix. How about we give it a little more character.```{julia}#| _kg_hide-input: true#| execution: {iopub.execute_input: '2021-06-26T16:35:23.444153Z', iopub.status.busy: '2021-06-26T16:35:23.443714Z', iopub.status.idle: '2021-06-26T16:35:23.873374Z', shell.execute_reply: '2021-06-26T16:35:23.869521Z', shell.execute_reply.started: '2021-06-26T16:35:23.444104Z'}from sklearn.utils.multiclass importunique_labelsfrom sklearn.metrics importconfusion_matrixdef plot_confusion_matrix(y_true, y_pred, classes, normalize=False, title=None, cmap=plt.cm.Blues):""" This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ if not title: if normalize: title ='Normalized confusion matrix' else: title ='Confusion matrix, without normalization'# Compute confusion matrix cm =confusion_matrix(y_true, y_pred)# Only use the labels that appear in the data classes = classes[unique_labels(y_true, y_pred)] if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]print("Normalized confusion matrix") else:print('Confusion matrix, without normalization')print(cm) fig, ax = plt.subplots()im= ax.imshow(cm, interpolation='nearest', cmap=cmap) ax.figure.colorbar(im, ax=ax)# We want to show all ticks... ax.set(xticks=np.arange(cm.shape[1]), yticks=np.arange(cm.shape[0]),# ... and label them with the respective list entries xticklabels=classes, yticklabels=classes, title=title, ylabel='True label', xlabel='Predicted label')# Rotate the tick labels and set their alignment. plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")# Loop over data dimensions and create text annotations. fmt ='.2f' if normalize else 'd' thresh = cm.max() /2. for i inrange(cm.shape[0]): for j inrange(cm.shape[1]): ax.text(j, i, format(cm[i, j], fmt), ha="center", va="center", color="white" if cm[i, j] > thresh else "black") fig.tight_layout() return axnp.set_printoptions(precision=2)class_names = np.array(['not_survived','survived'])# Plot non-normalized confusion matrixplot_confusion_matrix(y_test, y_pred, classes=class_names, title='Confusion matrix, without normalization')# Plot normalized confusion matrixplot_confusion_matrix(y_test, y_pred, classes=class_names, normalize=True, title='Normalized confusion matrix')plt.show()```<h1>AUC & ROC Curve</h1>```{julia}#| _uuid: 1e71bc7c685b757b6920076527780674d6f619bc#| execution: {iopub.execute_input: '2021-06-26T16:35:23.877891Z', iopub.status.busy: '2021-06-26T16:35:23.875713Z', iopub.status.idle: '2021-06-26T16:35:24.505751Z', shell.execute_reply: '2021-06-26T16:35:24.501314Z', shell.execute_reply.started: '2021-06-26T16:35:23.87783Z'}from sklearn.metrics importroc_curve, auc#plt.style.use('seaborn-pastel')y_score = logreg.decision_function(X_test)FPR, TPR, _ =roc_curve(y_test, y_score)ROC_AUC =auc(FPR, TPR)print (ROC_AUC)plt.figure(figsize =[11,9])plt.plot(FPR, TPR, label='ROC curve(area = %0.2f)'%ROC_AUC, linewidth=4)plt.plot([0,1],[0,1], 'k--', linewidth =4)plt.xlim([0.0,1.0])plt.ylim([0.0,1.05])plt.xlabel('False Positive Rate', fontsize =18)plt.ylabel('True Positive Rate', fontsize =18)plt.title('ROC for Titanic survivors', fontsize=18)plt.show()``````{julia}#| _uuid: 22f15e384372a1ece2f28cd9eced0c703a79598f#| execution: {iopub.execute_input: '2021-06-26T16:35:24.50731Z', iopub.status.busy: '2021-06-26T16:35:24.506981Z', iopub.status.idle: '2021-06-26T16:35:24.8481Z', shell.execute_reply: '2021-06-26T16:35:24.846974Z', shell.execute_reply.started: '2021-06-26T16:35:24.507251Z'}from sklearn.metrics importprecision_recall_curvey_score = logreg.decision_function(X_test)precision, recall, _ =precision_recall_curve(y_test, y_score)PR_AUC =auc(recall, precision)plt.figure(figsize=[11,9])plt.plot(recall, precision, label='PR curve (area = %0.2f)'% PR_AUC, linewidth=4)plt.xlabel('Recall', fontsize=18)plt.ylabel('Precision', fontsize=18)plt.title('Precision Recall Curve for Titanic survivors', fontsize=18)plt.legend(loc="lower right")plt.show()```## Using Cross-validation:Pros:- Helps reduce variance.- Expends models predictability.```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:35:24.855506Z', iopub.status.busy: '2021-06-26T16:35:24.853028Z', iopub.status.idle: '2021-06-26T16:35:24.862513Z', shell.execute_reply: '2021-06-26T16:35:24.861421Z', shell.execute_reply.started: '2021-06-26T16:35:24.853368Z'}sc = st_scale``````{julia}#| _uuid: 17791284c3e88236de2daa112422cde8ddcb0641#| execution: {iopub.execute_input: '2021-06-26T16:35:24.868704Z', iopub.status.busy: '2021-06-26T16:35:24.86826Z', iopub.status.idle: '2021-06-26T16:35:25.014634Z', shell.execute_reply: '2021-06-26T16:35:25.013771Z', shell.execute_reply.started: '2021-06-26T16:35:24.86853Z'}#| scrolled: true## Using StratifiedShuffleSplit## We can use KFold, StratifiedShuffleSplit, StratiriedKFold or ShuffleSplit, They are all close cousins. look at sklearn userguide for more info.from sklearn.model_selection importStratifiedShuffleSplit, cross_val_scorecv =StratifiedShuffleSplit(n_splits =10, test_size =.25, random_state =0 ) # run model 10x with 60/30 split intentionally leaving out 10%## Using standard scale for the whole dataset.## saving the feature names for decision tree displaycolumn_names = X.columnsX = sc.fit_transform(X)accuracies =cross_val_score(LogisticRegression(solver='liblinear'), X,y, cv = cv)print ("Cross-Validation accuracy scores:{}".format(accuracies))print ("Mean Cross-Validation accuracy score: {}".format(round(accuracies.mean(),5)))```## Grid Search on Logistic Regression- What is grid search?- What are the pros and cons?**Gridsearch** is a simple concept but effective technique in Machine Learning. The word **GridSearch** stands for the fact that we are searching for optimal parameter/parameters over a "grid." These optimal parameters are also known as **Hyperparameters**. **The Hyperparameters are model parameters that are set before fitting the model and determine the behavior of the model.**. For example, when we choose to use linear regression, we may decide to add a penalty to the loss function such as Ridge or Lasso. These penalties require specific alpha (the strength of the regularization technique) to set beforehand. The higher the value of alpha, the more penalty is being added. GridSearch finds the optimal value of alpha among a range of values provided by us, and then we go on and use that optimal value to fit the model and get sweet results. It is essential to understand those model parameters are different from models outcomes, for example, **coefficients** or model evaluation metrics such as **accuracy score** or **mean squared error** are model outcomes and different than hyperparameters.#### This part of the kernel is a working progress. Please check back again for future updates.####```{julia}#| _cell_guid: 0620523c-b33b-4302-8a1c-4b6759ffa5fa#| _uuid: 36a379a00a31dd161be1723f65490990294fe13d#| execution: {iopub.execute_input: '2021-06-26T16:35:25.021234Z', iopub.status.busy: '2021-06-26T16:35:25.018883Z', iopub.status.idle: '2021-06-26T16:35:40.193433Z', shell.execute_reply: '2021-06-26T16:35:40.192566Z', shell.execute_reply.started: '2021-06-26T16:35:25.021181Z'}from sklearn.model_selection importGridSearchCV, StratifiedKFold## C_vals is the alpla value of lasso and ridge regression(as alpha increases the model complexity decreases,)## remember effective alpha scores are 0<alpha<infinityC_vals = [0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,16.5,17,17.5,18]## Choosing penalties(Lasso(l1) or Ridge(l2))penalties = ['l1','l2']## Choose a cross validation strategy.cv =StratifiedShuffleSplit(n_splits =10, test_size =.25)## setting param for param_grid in GridSearchCV.param = {'penalty': penalties, 'C': C_vals}logreg =LogisticRegression(solver='liblinear')## Calling on GridSearchCV object.grid =GridSearchCV(estimator=LogisticRegression(), param_grid = param, scoring ='accuracy', n_jobs =-1, cv = cv )## Fitting the modelgrid.fit(X, y)``````{julia}#| _cell_guid: 1fa35072-87c4-4f47-86ab-dda03d4b7b15#| _uuid: 4c6650e39550527b271ddf733dcfe5221bcd5c98#| execution: {iopub.execute_input: '2021-06-26T16:35:40.195216Z', iopub.status.busy: '2021-06-26T16:35:40.194925Z', iopub.status.idle: '2021-06-26T16:35:40.201259Z', shell.execute_reply: '2021-06-26T16:35:40.200225Z', shell.execute_reply.started: '2021-06-26T16:35:40.19517Z'}## Getting the best of everything.print (grid.best_score_)print (grid.best_params_)print(grid.best_estimator_)```#### Using the best parameters from the grid-search.```{julia}#| _uuid: ba53f6b3610821dc820936dde7b7803a54d20f5a#| execution: {iopub.execute_input: '2021-06-26T16:35:40.204086Z', iopub.status.busy: '2021-06-26T16:35:40.203576Z', iopub.status.idle: '2021-06-26T16:35:40.214041Z', shell.execute_reply: '2021-06-26T16:35:40.212929Z', shell.execute_reply.started: '2021-06-26T16:35:40.20393Z'}### Using the best parameters from the grid-search.logreg_grid = grid.best_estimator_logreg_grid.score(X,y)```#### This part of the kernel is a working progress. Please check back again for future updates.####Resources:- [Confusion Matrix](https://www.youtube.com/watch?v=8Oog7TXHvFY)### Under-fitting & Over-fitting:So, we have our first model and its score. But, how do we make sure that our model is performing well. Our model may be overfitting or underfitting. In fact, for those of you don't know what overfitting and underfitting is, Let's find out.As you see in the chart above. **Underfitting** is when the model fails to capture important aspects of the data and therefore introduces more bias and performs poorly. On the other hand, **Overfitting** is when the model performs too well on the training data but does poorly in the validation set or test sets. This situation is also known as having less bias but more variation and perform poorly as well. Ideally, we want to configure a model that performs well not only in the training data but also in the test data. This is where **bias-variance tradeoff** comes in. When we have a model that overfits, meaning less biased and more of variance, we introduce some bias in exchange of having much less variance. One particular tactic for this task is regularization models (Ridge, Lasso, Elastic Net). These models are built to deal with the bias-variance tradeoff. This [kernel](https://www.kaggle.com/dansbecker/underfitting-and-overfitting) explains this topic well. Also, the following chart gives us a mental picture of where we want our models to be.Ideally, we want to pick a sweet spot where the model performs well in training set, validation set, and test set. As the model gets complex, bias decreases, variance increases. However, the most critical part is the error rates. We want our models to be at the bottom of that **U** shape where the error rate is the least. That sweet spot is also known as **Optimum Model Complexity(OMC).**Now that we know what we want in terms of under-fitting and over-fitting, let's talk about how to combat them.How to combat over-fitting?<ul> <li>Simplify the model by using less parameters.</li> <li>Simplify the model by changing the hyperparameters.</li> <li>Introducing regularization models. </li> <li>Use more training data. </li> <li>Gatter more data ( and gather better quality data). </li> </ul> #### This part of the kernel is a working progress. Please check back again for future updates.###### 7b. K-Nearest Neighbor classifier(KNN)<a id="knn"></a>---```{julia}#| _uuid: 953bc2c18b5fd93bcd51a42cc04a0539d86d5bac#| execution: {iopub.execute_input: '2021-06-26T16:35:40.216328Z', iopub.status.busy: '2021-06-26T16:35:40.215853Z', iopub.status.idle: '2021-06-26T16:35:40.416985Z', shell.execute_reply: '2021-06-26T16:35:40.416038Z', shell.execute_reply.started: '2021-06-26T16:35:40.216141Z'}## Importing the model.from sklearn.neighbors importKNeighborsClassifier## calling on the model oject.knn =KNeighborsClassifier(metric='minkowski', p=2)## knn classifier works by doing euclidian distance## doing 10 fold staratified-shuffle-split cross validationcv =StratifiedShuffleSplit(n_splits=10, test_size=.25, random_state=2)accuracies =cross_val_score(knn, X,y, cv = cv, scoring='accuracy')print ("Cross-Validation accuracy scores:{}".format(accuracies))print ("Mean Cross-Validation accuracy score: {}".format(round(accuracies.mean(),3)))```#### Manually find the best possible k value for KNN```{julia}#| _uuid: 9c0f44165e08f63ae5436180c5a7182e6db5c63f#| execution: {iopub.execute_input: '2021-06-26T16:35:40.418857Z', iopub.status.busy: '2021-06-26T16:35:40.418419Z', iopub.status.idle: '2021-06-26T16:35:46.541601Z', shell.execute_reply: '2021-06-26T16:35:46.540815Z', shell.execute_reply.started: '2021-06-26T16:35:40.418687Z'}## Search for an optimal value of k for KNN.k_range =range(1,31)k_scores = []for k in k_range: knn =KNeighborsClassifier(n_neighbors=k) scores =cross_val_score(knn, X,y, cv = cv, scoring ='accuracy') k_scores.append(scores.mean())print("Accuracy scores are: {}\n".format(k_scores))print ("Mean accuracy score: {}".format(np.mean(k_scores)))``````{julia}#| _uuid: e123680b431ba99d399fa8205c32bcfdc7cabd81#| execution: {iopub.execute_input: '2021-06-26T16:35:46.543234Z', iopub.status.busy: '2021-06-26T16:35:46.542789Z', iopub.status.idle: '2021-06-26T16:35:46.685143Z', shell.execute_reply: '2021-06-26T16:35:46.684141Z', shell.execute_reply.started: '2021-06-26T16:35:46.543184Z'}from matplotlib importpyplot as pltplt.plot(k_range, k_scores)```### Grid search on KNN classifier```{julia}#| _uuid: 507e2a7cdb28a47be45ed247f1343c123a6b592b#| execution: {iopub.execute_input: '2021-06-26T16:35:46.687026Z', iopub.status.busy: '2021-06-26T16:35:46.686671Z', iopub.status.idle: '2021-06-26T16:35:55.465245Z', shell.execute_reply: '2021-06-26T16:35:55.464452Z', shell.execute_reply.started: '2021-06-26T16:35:46.686956Z'}from sklearn.model_selection importGridSearchCV## trying out multiple values for kk_range =range(1,31)##weights_options=['uniform','distance']#param = {'n_neighbors':k_range, 'weights':weights_options}## Using startifiedShufflesplit.cv =StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)# estimator = knn, param_grid = param, n_jobs = -1 to instruct scikit learn to use all available processors.grid =GridSearchCV(KNeighborsClassifier(), param,cv=cv,verbose = False, n_jobs=-1)## Fitting the model.grid.fit(X,y)``````{julia}#| _uuid: c710770daa6cf327dcc28e18b3ed180fabecd49b#| execution: {iopub.execute_input: '2021-06-26T16:35:55.466929Z', iopub.status.busy: '2021-06-26T16:35:55.466654Z', iopub.status.idle: '2021-06-26T16:35:55.475348Z', shell.execute_reply: '2021-06-26T16:35:55.474575Z', shell.execute_reply.started: '2021-06-26T16:35:55.466883Z'}print(grid.best_score_)print(grid.best_params_)print(grid.best_estimator_)```#### Using best estimator from grid search using KNN.```{julia}#| _uuid: dd1fbf223c4ec9db65dde4924e2827e46029da1a#| execution: {iopub.execute_input: '2021-06-26T16:35:55.477181Z', iopub.status.busy: '2021-06-26T16:35:55.476629Z', iopub.status.idle: '2021-06-26T16:35:55.555736Z', shell.execute_reply: '2021-06-26T16:35:55.554788Z', shell.execute_reply.started: '2021-06-26T16:35:55.476983Z'}### Using the best parameters from the grid-search.knn_grid= grid.best_estimator_knn_grid.score(X,y)```#### Using RandomizedSearchCVRandomized search is a close cousin of grid search. It doesn't always provide the best result but its fast.```{julia}#| _uuid: e159b267a57d7519fc0ee8b3d1e95b841d3daf60#| execution: {iopub.execute_input: '2021-06-26T16:35:55.557501Z', iopub.status.busy: '2021-06-26T16:35:55.557097Z', iopub.status.idle: '2021-06-26T16:36:02.332003Z', shell.execute_reply: '2021-06-26T16:36:02.331364Z', shell.execute_reply.started: '2021-06-26T16:35:55.557338Z'}from sklearn.model_selection importRandomizedSearchCV## trying out multiple values for kk_range =range(1,31)##weights_options=['uniform','distance']#param = {'n_neighbors':k_range, 'weights':weights_options}## Using startifiedShufflesplit.cv =StratifiedShuffleSplit(n_splits=10, test_size=.30)# estimator = knn, param_grid = param, n_jobs = -1 to instruct scikit learn to use all available processors.## for RandomizedSearchCV,grid =RandomizedSearchCV(KNeighborsClassifier(), param,cv=cv,verbose = False, n_jobs=-1, n_iter=40)## Fitting the model.grid.fit(X,y)``````{julia}#| _uuid: c58492525dd18659ef9f9c774ee7601a55e96f36#| execution: {iopub.execute_input: '2021-06-26T16:36:02.333632Z', iopub.status.busy: '2021-06-26T16:36:02.333341Z', iopub.status.idle: '2021-06-26T16:36:02.340211Z', shell.execute_reply: '2021-06-26T16:36:02.338113Z', shell.execute_reply.started: '2021-06-26T16:36:02.333572Z'}print (grid.best_score_)print (grid.best_params_)print(grid.best_estimator_)``````{julia}#| _uuid: 6fb31588585d50de773ba0db6c378363841a5313#| execution: {iopub.execute_input: '2021-06-26T16:36:02.343117Z', iopub.status.busy: '2021-06-26T16:36:02.34256Z', iopub.status.idle: '2021-06-26T16:36:02.420683Z', shell.execute_reply: '2021-06-26T16:36:02.419712Z', shell.execute_reply.started: '2021-06-26T16:36:02.342922Z'}### Using the best parameters from the grid-search.knn_ran_grid = grid.best_estimator_knn_ran_grid.score(X,y)```## Gaussian Naive Bayes<a id="gaussian_naive"></a>---```{julia}#| _uuid: 8b2435030dbef1303bfc2864d227f5918f359330#| execution: {iopub.execute_input: '2021-06-26T16:36:02.422487Z', iopub.status.busy: '2021-06-26T16:36:02.421997Z', iopub.status.idle: '2021-06-26T16:36:02.433216Z', shell.execute_reply: '2021-06-26T16:36:02.43234Z', shell.execute_reply.started: '2021-06-26T16:36:02.422237Z'}# Gaussian Naive Bayesfrom sklearn.naive_bayes importGaussianNBfrom sklearn.metrics importaccuracy_scoregaussian =GaussianNB()gaussian.fit(X, y)y_pred = gaussian.predict(X_test)gaussian_accy =round(accuracy_score(y_pred, y_test), 3)print(gaussian_accy)```## Support Vector Machines(SVM)<a id="svm"></a>---```{julia}#| _uuid: 56895672215b0b6365c6aaa10e446216ef635f53#| execution: {iopub.execute_input: '2021-06-26T16:36:02.435838Z', iopub.status.busy: '2021-06-26T16:36:02.435282Z', iopub.status.idle: '2021-06-26T16:37:25.882123Z', shell.execute_reply: '2021-06-26T16:37:25.881483Z', shell.execute_reply.started: '2021-06-26T16:36:02.435553Z'}from sklearn.svm importSVCCs = [0.001, 0.01, 0.1, 1,1.5,2,2.5,3,4,5, 10] ## penalty parameter C for the error term.gammas = [0.0001,0.001, 0.01, 0.1, 1]param_grid = {'C': Cs, 'gamma': gammas}cv =StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)grid_search =GridSearchCV(SVC(kernel ='rbf', probability=True), param_grid, cv=cv) ## 'rbf' stands for gaussian kernelgrid_search.fit(X,y)``````{julia}#| _uuid: 4108264ea5d18e3d3fa38a30584a032c734d6d49#| execution: {iopub.execute_input: '2021-06-26T16:37:25.8839Z', iopub.status.busy: '2021-06-26T16:37:25.883609Z', iopub.status.idle: '2021-06-26T16:37:25.890029Z', shell.execute_reply: '2021-06-26T16:37:25.889244Z', shell.execute_reply.started: '2021-06-26T16:37:25.883852Z'}print(grid_search.best_score_)print(grid_search.best_params_)print(grid_search.best_estimator_)``````{julia}#| _uuid: db18a3b5475f03b21a039e31e4962c43f7caffdc#| execution: {iopub.execute_input: '2021-06-26T16:37:25.892123Z', iopub.status.busy: '2021-06-26T16:37:25.891542Z', iopub.status.idle: '2021-06-26T16:37:25.934216Z', shell.execute_reply: '2021-06-26T16:37:25.933352Z', shell.execute_reply.started: '2021-06-26T16:37:25.892073Z'}# using the best found hyper paremeters to get the score.svm_grid = grid_search.best_estimator_svm_grid.score(X,y)```## Decision Tree ClassifierDecision tree works by breaking down the dataset into small subsets. This breaking down process is done by asking questions about the features of the datasets. The idea is to unmix the labels by asking fewer questions necessary. As we ask questions, we are breaking down the dataset into more subsets. Once we have a subgroup with only the unique type of labels, we end the tree in that node. If you would like to get a detailed understanding of Decision tree classifier, please take a look at [this](https://www.kaggle.com/masumrumi/decision-tree-with-titanic-dataset) kernel.```{julia}#| _cell_guid: 38c90de9-d2e9-4341-a378-a854762d8be2#| _uuid: 18efb62b713591d1512010536ff10d9f6a91ec11#| execution: {iopub.execute_input: '2021-06-26T16:37:25.936111Z', iopub.status.busy: '2021-06-26T16:37:25.935654Z', iopub.status.idle: '2021-06-26T16:37:57.983942Z', shell.execute_reply: '2021-06-26T16:37:57.983035Z', shell.execute_reply.started: '2021-06-26T16:37:25.935918Z'}from sklearn.tree importDecisionTreeClassifiermax_depth =range(1,30)max_feature = [21,22,23,24,25,26,28,29,30,'auto']criterion=["entropy", "gini"]param = {'max_depth':max_depth,'max_features':max_feature,'criterion': criterion}grid =GridSearchCV(DecisionTreeClassifier(), param_grid = param, verbose=False, cv=StratifiedKFold(n_splits=20, random_state=15, shuffle=True), n_jobs =-1)grid.fit(X, y)``````{julia}#| _cell_guid: b2222e4e-f5f2-4601-b95f-506d7811610a#| _uuid: b0fb5055e6b4a7fb69ef44f669c4df693ce46212#| execution: {iopub.execute_input: '2021-06-26T16:37:57.988346Z', iopub.status.busy: '2021-06-26T16:37:57.988045Z', iopub.status.idle: '2021-06-26T16:37:57.994617Z', shell.execute_reply: '2021-06-26T16:37:57.993662Z', shell.execute_reply.started: '2021-06-26T16:37:57.988287Z'}#| scrolled: trueprint( grid.best_params_)print (grid.best_score_)print (grid.best_estimator_)``````{julia}#| _cell_guid: d731079a-31b4-429a-8445-48597bb2639d#| _uuid: 76c26437d374442826ef140574c5c4880ae1e853#| execution: {iopub.execute_input: '2021-06-26T16:37:57.996876Z', iopub.status.busy: '2021-06-26T16:37:57.996238Z', iopub.status.idle: '2021-06-26T16:37:58.010892Z', shell.execute_reply: '2021-06-26T16:37:58.010194Z', shell.execute_reply.started: '2021-06-26T16:37:57.996695Z'}dectree_grid = grid.best_estimator_## using the best found hyper paremeters to get the score.dectree_grid.score(X,y)``` <h4> Let's look at the feature importance from decision tree grid.</h4>```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:37:58.013756Z', iopub.status.busy: '2021-06-26T16:37:58.01221Z', iopub.status.idle: '2021-06-26T16:37:58.034194Z', shell.execute_reply: '2021-06-26T16:37:58.033436Z', shell.execute_reply.started: '2021-06-26T16:37:58.013683Z'}## feature importancefeature_importances = pd.DataFrame(dectree_grid.feature_importances_, index = column_names, columns=['importance'])feature_importances.sort_values(by='importance', ascending=False).head(10)```These are the top 10 features determined by **Decision Tree** helped classifing the fates of many passenger on Titanic on that night.## 7f. Random Forest Classifier<a id="random_forest"></a>I admire working with decision trees because of the potential and basics they provide towards building a more complex model like Random Forest(RF). RF is an ensemble method (combination of many decision trees) which is where the "forest" part comes in. One crucial details about Random Forest is that while using a forest of decision trees, RF model <b>takes random subsets of the original dataset(bootstrapped)</b> and <b>random subsets of the variables(features/columns)</b>. Using this method, the RF model creates 100's-1000's(the amount can be menually determined) of a wide variety of decision trees. This variety makes the RF model more effective and accurate. We then run each test data point through all of these 100's to 1000's of decision trees or the RF model and take a vote on the output.```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:37:58.040453Z', iopub.status.busy: '2021-06-26T16:37:58.038063Z', iopub.status.idle: '2021-06-26T16:39:53.557817Z', shell.execute_reply: '2021-06-26T16:39:53.556973Z', shell.execute_reply.started: '2021-06-26T16:37:58.040398Z'}from sklearn.model_selection importGridSearchCV, StratifiedKFold, StratifiedShuffleSplitfrom sklearn.ensemble importRandomForestClassifiern_estimators = [140,145,150,155,160];max_depth =range(1,10);criterions = ['gini', 'entropy'];cv =StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)parameters = {'n_estimators':n_estimators,'max_depth':max_depth,'criterion': criterions }grid =GridSearchCV(estimator=RandomForestClassifier(max_features='auto'), param_grid=parameters, cv=cv, n_jobs =-1)grid.fit(X,y)``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:39:53.559492Z', iopub.status.busy: '2021-06-26T16:39:53.559192Z', iopub.status.idle: '2021-06-26T16:39:53.567897Z', shell.execute_reply: '2021-06-26T16:39:53.56675Z', shell.execute_reply.started: '2021-06-26T16:39:53.559434Z'}print (grid.best_score_)print (grid.best_params_)print (grid.best_estimator_)``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:39:53.570209Z', iopub.status.busy: '2021-06-26T16:39:53.56951Z', iopub.status.idle: '2021-06-26T16:39:53.600458Z', shell.execute_reply: '2021-06-26T16:39:53.599531Z', shell.execute_reply.started: '2021-06-26T16:39:53.569928Z'}rf_grid = grid.best_estimator_rf_grid.score(X,y)``````{julia}#| _kg_hide-input: true#| execution: {iopub.execute_input: '2021-06-26T16:39:53.602628Z', iopub.status.busy: '2021-06-26T16:39:53.602028Z', iopub.status.idle: '2021-06-26T16:39:53.613347Z', shell.execute_reply: '2021-06-26T16:39:53.612229Z', shell.execute_reply.started: '2021-06-26T16:39:53.602297Z'}from sklearn.metrics importclassification_report# Print classification report for y_testprint(classification_report(y_test, y_pred, labels=rf_grid.classes_))```## Feature Importance```{julia}#| _kg_hide-input: true#| execution: {iopub.execute_input: '2021-06-26T16:39:53.615537Z', iopub.status.busy: '2021-06-26T16:39:53.614947Z', iopub.status.idle: '2021-06-26T16:39:53.637392Z', shell.execute_reply: '2021-06-26T16:39:53.63647Z', shell.execute_reply.started: '2021-06-26T16:39:53.615192Z'}## feature importancefeature_importances = pd.DataFrame(rf_grid.feature_importances_, index = column_names, columns=['importance'])feature_importances.sort_values(by='importance', ascending=False).head(10)```<h3>Why Random Forest?(Pros and Cons)</h3>---<h2>Introducing Ensemble Learning</h2>In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.There are two types of ensemple learnings.**Bagging/Averaging Methods**> In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.**Boosting Methods**> The other family of ensemble methods are boosting methods, where base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.<h4 align="right">Source:GA</h4>Resource: <a href="https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205">Ensemble methods: bagging, boosting and stacking</a>---## 7g. Bagging Classifier<a id="bagging"></a>---<a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html">Bagging Classifier</a>(Bootstrap Aggregating) is the ensemble method that involves manipulating the training set by resampling and running algorithms on it. Let's do a quick review:- Bagging classifier uses a process called bootstrapped dataset to create multiple datasets from one original dataset and runs algorithm on each one of them. Here is an image to show how bootstrapped dataset works.<img src="https://uc-r.github.io/public/images/analytics/bootstrap/bootstrap.png" width="600"><h4 align="center">Resampling from original dataset to bootstrapped datasets</h4><h4 align="right">Source: https://uc-r.github.io</h4>- After running a learning algorithm on each one of the bootstrapped datasets, all models are combined by taking their average. the test data/new data then go through this averaged classifier/combined classifier and predict the output.Here is an image to make it clear on how bagging works,<img src="https://prachimjoshi.files.wordpress.com/2015/07/screen_shot_2010-12-03_at_5-46-21_pm.png" width="600"><h4 align="right">Source: https://prachimjoshi.files.wordpress.com</h4>Please check out [this](https://www.kaggle.com/masumrumi/bagging-with-titanic-dataset) kernel if you want to find out more about bagging classifier.```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:39:53.639198Z', iopub.status.busy: '2021-06-26T16:39:53.63871Z', iopub.status.idle: '2021-06-26T16:40:17.162923Z', shell.execute_reply: '2021-06-26T16:40:17.162277Z', shell.execute_reply.started: '2021-06-26T16:39:53.638945Z'}from sklearn.ensemble importBaggingClassifiern_estimators = [10,30,50,70,80,150,160, 170,175,180,185];cv =StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)parameters = {'n_estimators':n_estimators, }grid =GridSearchCV(BaggingClassifier(base_estimator=None, ## If None, then the base estimator is a decision tree. bootstrap_features=False), param_grid=parameters, cv=cv, n_jobs =-1)grid.fit(X,y)``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:40:17.164621Z', iopub.status.busy: '2021-06-26T16:40:17.164322Z', iopub.status.idle: '2021-06-26T16:40:17.172911Z', shell.execute_reply: '2021-06-26T16:40:17.172302Z', shell.execute_reply.started: '2021-06-26T16:40:17.164559Z'}print (grid.best_score_)print (grid.best_params_)print (grid.best_estimator_)``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:40:17.174968Z', iopub.status.busy: '2021-06-26T16:40:17.174466Z', iopub.status.idle: '2021-06-26T16:40:17.226122Z', shell.execute_reply: '2021-06-26T16:40:17.225161Z', shell.execute_reply.started: '2021-06-26T16:40:17.174765Z'}bagging_grid = grid.best_estimator_bagging_grid.score(X,y)```<h3>Why use Bagging? (Pros and cons)</h3>Bagging works best with strong and complex models(for example, fully developed decision trees). However, don't let that fool you to thinking that similar to a decision tree, bagging also overfits the model. Instead, bagging reduces overfitting since a lot of the sample training data are repeated and used to create base estimators. With a lot of equally likely training data, bagging is not very susceptible to overfitting with noisy data, therefore reduces variance. However, the downside is that this leads to an increase in bias.<h4>Random Forest VS. Bagging Classifier</h4>If some of you are like me, you may find Random Forest to be similar to Bagging Classifier. However, there is a fundamental difference between these two which is **Random Forests ability to pick subsets of features in each node.** I will elaborate on this in a future update.## 7h. AdaBoost Classifier<a id="AdaBoost"></a>---AdaBoost is another <b>ensemble model</b> and is quite different than Bagging. Let's point out the core concepts.> AdaBoost combines a lot of "weak learners"(they are also called stump; a tree with only one node and two leaves) to make classifications.> This base model fitting is an iterative process where each stump is chained one after the other; <b>It cannot run in parallel.</b>> <b>Some stumps get more say in the final classifications than others.</b> The models use weights that are assigned to each data point/raw indicating their "importance." Samples with higher weight have a higher influence on the total error of the next model and gets more priority. The first stump starts with uniformly distributed weight which means, in the beginning, every datapoint have an equal amount of weights.> <b>Each stump is made by talking the previous stump's mistakes into account.</b> After each iteration weights gets re-calculated in order to take the errors/misclassifications from the last stump into consideration.> The final prediction is typically constructed by a weighted vote where weights for each base model depends on their training errors or misclassification rates.To illustrate what we have talked about so far let's look at the following visualization.<img src="https://cdn-images-1.medium.com/max/1600/0*paPv7vXuq4eBHZY7.png"><h5 align="right"> Source: Diogo(Medium)</h5>Let's dive into each one of the nitty-gritty stuff about AdaBoost:---> <b>First</b>, we determine the best feature to split the dataset using Gini index(basics from decision tree). The feature with the lowest Gini index becomes the first stump in the AdaBoost stump chain(the lower the Gini index is, the better unmixed the label is, therefore, better split).---> <b>Secondly</b>, we need to determine how much say a stump will have in the final classification and how we can calculate that.- We learn how much say a stump has in the final classification by calculating how well it classified the samples (aka calculate the total error of the weight).- The <b>Total Error</b> for a stump is the sum of the weights associated with the incorrectly classified samples. For example, lets say, we start a stump with 10 datasets. The first stump will uniformly distribute an weight amoung all the datapoints. Which means each data point will have 1/10 weight. Let's say once the weight is distributed we run the model and find 2 incorrect predicitons. In order to calculate the total erorr we add up all the misclassified weights. Here we get 1/10 + 1/10 = 2/10 or 1/5. This is our total error. We can also think about it$$ \epsilon_t = \frac{\text{misclassifications}\_t}{\text{observations}\_t} $$- Since the weight is uniformly distributed(all add up to 1) among all data points, the total error will always be between 0(perfect stump) and 1(horrible stump).- We use the total error to determine the amount of say a stump has in the final classification using the following formula$$ \alpha_t = \frac{1}{2}ln \left(\frac{1-\epsilon_t}{\epsilon_t}\right) \text{where } \epsilon_t < 1$$Where $\epsilon_t$ is the misclassification rate for the current classifier:$$ \epsilon_t = \frac{\text{misclassifications}\_t}{\text{observations}\_t} $$Here...- $\alpha_t$ = Amount of Say- $\epsilon_t$ = Total errorWe can draw a graph to determine the amount of say using the value of total error(0 to 1)<img src="http://chrisjmccormick.files.wordpress.com/2013/12/adaboost_alphacurve.png"><h5 align="right"> Source: Chris McCormick</h5>- The blue line tells us the amount of say for <b>Total Error(Error rate)</b> between 0 and 1.- When the stump does a reasonably good job, and the <b>total error</b> is minimal, then the <b>amount of say(Alpha)</b> is relatively large, and the alpha value is positive.- When the stump does an average job(similar to a coin flip/the ratio of getting correct and incorrect ~50%/50%), then the <b>total error</b> is ~0.5. In this case the <b>amount of say</b> is <b>0</b>.- When the error rate is high let's say close to 1, then the <b>amount of say</b> will be negative, which means if the stump outputs a value as "survived" the included weight will turn that value into "not survived."P.S. If the <b>Total Error</b> is 1 or 0, then this equation will freak out. A small amount of error is added to prevent this from happening.---> <b>Third</b>, We need to learn how to modify the weights so that the next stump will take the errors that the current stump made into account. The pseducode for calculating the new sample weight is as follows.$$ New Sample Weight = Sample Weight + e^{\alpha_t}$$Here the $\alpha_t(AmountOfSay)$ can be positive or negative depending whether the sample was correctly classified or misclassified by the current stump. We want to increase the sample weight of the misclassified samples; hinting the next stump to put more emphasize on those. Inversely, we want to decrease the sample weight of the correctly classified samples; hinting the next stump to put less emphasize on those.The following equation help us to do this calculation.$$ D\_{t+1}(i) = D_t(i) e^{-\alpha_t y_i h_t(x_i)} $$Here,- $D_{t+1}(i)$ = New Sample Weight.- $D_t(i)$ = Current Sample weight.- $\alpha_t$ = Amount of Say, alpha value, this is the coefficient that gets updated in each iteration and- $y_i h_t(x_i)$ = place holder for 1 if stump correctly classified, -1 if misclassified.Finally, we put together the combined classifier, which is$$ AdaBoost(X) = sign\left(\sum\_{t=1}^T\alpha_t h_t(X)\right) $$Here,$AdaBoost(X)$ is the classification predictions for $y$ using predictor matrix $X$$T$ is the set of "weak learners"$\alpha_t$ is the contribution weight for weak learner $t$$h_t(X)$ is the prediction of weak learner $t$and $y$ is binary **with values -1 and 1**P.S. Since the stump barely captures essential specs about the dataset, the model is highly biased in the beginning. However, as the chain of stumps continues and at the end of the process, AdaBoost becomes a strong tree and reduces both bias and variance.<h3>Resources:</h3><ul> <li><a href="https://www.youtube.com/watch?v=LsK-xG1cLYA">Statquest</a></li> <li><a href="https://www.youtube.com/watch?v=-DUxtdeCiB4">Principles of Machine Learning | AdaBoost(Video)</a></li></ul>```{julia}#| execution: {iopub.execute_input: '2021-06-26T16:40:17.227822Z', iopub.status.busy: '2021-06-26T16:40:17.227396Z', iopub.status.idle: '2021-06-26T16:41:28.311627Z', shell.execute_reply: '2021-06-26T16:41:28.311009Z', shell.execute_reply.started: '2021-06-26T16:40:17.227656Z'}from sklearn.ensemble importAdaBoostClassifiern_estimators = [100,140,145,150,160, 170,175,180,185];cv =StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)learning_r = [0.1,1,0.01,0.5]parameters = {'n_estimators':n_estimators,'learning_rate':learning_r }grid =GridSearchCV(AdaBoostClassifier(base_estimator=None, ## If None, then the base estimator is a decision tree. ), param_grid=parameters, cv=cv, n_jobs =-1)grid.fit(X,y)``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:41:28.313135Z', iopub.status.busy: '2021-06-26T16:41:28.31287Z', iopub.status.idle: '2021-06-26T16:41:28.318909Z', shell.execute_reply: '2021-06-26T16:41:28.318191Z', shell.execute_reply.started: '2021-06-26T16:41:28.313088Z'}print (grid.best_score_)print (grid.best_params_)print (grid.best_estimator_)``````{julia}#| execution: {iopub.execute_input: '2021-06-26T16:41:28.320845Z', iopub.status.busy: '2021-06-26T16:41:28.320267Z', iopub.status.idle: '2021-06-26T16:41:28.35912Z', shell.execute_reply: '2021-06-26T16:41:28.358535Z', shell.execute_reply.started: '2021-06-26T16:41:28.320797Z'}adaBoost_grid = grid.best_estimator_adaBoost_grid.score(X,y)```## Pros and cons of boosting---### Pros- Achieves higher performance than bagging when hyper-parameters tuned properly.- Can be used for classification and regression equally well.- Easily handles mixed data types.- Can use "robust" loss functions that make the model resistant to outliers.---### Cons- Difficult and time consuming to properly tune hyper-parameters.- Cannot be parallelized like bagging (bad scalability when huge amounts of data).- More risk of overfitting compared to bagging.<h3>Resources: </h3><ul> <li><a href="http://mccormickml.com/2013/12/13/adaboost-tutorial/">AdaBoost Tutorial-Chris McCormick</a></li> <li><a href="http://rob.schapire.net/papers/explaining-adaboost.pdf">Explaining AdaBoost by Robert Schapire(One of the original author of AdaBoost)</a></li></ul>## 7i. Gradient Boosting Classifier<a id="gradient_boosting"></a>---```{julia}#| _cell_guid: d32d6df9-b8e7-4637-bacc-2baec08547b8#| _uuid: fd788c4f4cde834a1329f325f1f59e3f77c37e42#| execution: {iopub.execute_input: '2021-06-26T16:41:28.360536Z', iopub.status.busy: '2021-06-26T16:41:28.360265Z', iopub.status.idle: '2021-06-26T16:41:28.521396Z', shell.execute_reply: '2021-06-26T16:41:28.520426Z', shell.execute_reply.started: '2021-06-26T16:41:28.360479Z'}#| scrolled: true# Gradient Boosting Classifierfrom sklearn.ensemble importGradientBoostingClassifiergradient_boost =GradientBoostingClassifier()gradient_boost.fit(X, y)y_pred = gradient_boost.predict(X_test)gradient_accy =round(accuracy_score(y_pred, y_test), 3)print(gradient_accy)```<div class=" alert alert-info"><h3>Resources: </h3><ul> <li><a href="https://www.youtube.com/watch?v=sDv4f4s2SB8">Gradient Descent(StatQuest)</a></li> <li><a href="https://www.youtube.com/watch?v=3CC4N4z3GJc">Gradient Boost(Regression Main Ideas)(StatQuest)</a></li> <li><a href="https://www.youtube.com/watch?v=3CC4N4z3GJc">Gradient Boost(Regression Calculation)(StatQuest)</a></li> <li><a href="https://www.youtube.com/watch?v=jxuNLH5dXCs">Gradient Boost(Classification Main Ideas)(StatQuest)</a></li> <li><a href="https://www.youtube.com/watch?v=StWY5QWMXCw">Gradient Boost(Classification Calculation)(StatQuest)</a></li> <li><a href="https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/">Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python</a></li></ul></div>## 7j. XGBClassifier<a id="XGBClassifier"></a>---```{julia}#| _cell_guid: 5d94cc5b-d8b7-40d3-b264-138539daabfa#| _uuid: 9d96154d2267ea26a6682a73bd1850026eb1303b#| execution: {iopub.execute_input: '2021-06-26T16:41:28.523177Z', iopub.status.busy: '2021-06-26T16:41:28.522724Z', iopub.status.idle: '2021-06-26T16:41:28.526955Z', shell.execute_reply: '2021-06-26T16:41:28.525945Z', shell.execute_reply.started: '2021-06-26T16:41:28.522964Z'}# from xgboost import XGBClassifier# XGBClassifier = XGBClassifier()# XGBClassifier.fit(X, y)# y_pred = XGBClassifier.predict(X_test)# XGBClassifier_accy = round(accuracy_score(y_pred, y_test), 3)# print(XGBClassifier_accy)```## 7k. Extra Trees Classifier<a id="extra_tree"></a>---```{julia}#| _cell_guid: 2e567e01-6b5f-4313-84af-cc378c3b709e#| _uuid: c9b958e2488adf6f79401c677087e3250d63ac9b#| execution: {iopub.execute_input: '2021-06-26T16:41:28.528841Z', iopub.status.busy: '2021-06-26T16:41:28.528382Z', iopub.status.idle: '2021-06-26T16:41:28.555697Z', shell.execute_reply: '2021-06-26T16:41:28.554889Z', shell.execute_reply.started: '2021-06-26T16:41:28.528664Z'}from sklearn.ensemble importExtraTreesClassifierExtraTreesClassifier =ExtraTreesClassifier()ExtraTreesClassifier.fit(X, y)y_pred = ExtraTreesClassifier.predict(X_test)extraTree_accy =round(accuracy_score(y_pred, y_test), 3)print(extraTree_accy)```## 7l. Gaussian Process Classifier<a id="GaussianProcessClassifier"></a>---```{julia}#| _cell_guid: 23bd5744-e04d-49bb-9d70-7c2a518f76dd#| _uuid: 57fc008eea2ce1c0b595f888a82ddeaee6ce2177#| execution: {iopub.execute_input: '2021-06-26T16:41:28.557268Z', iopub.status.busy: '2021-06-26T16:41:28.556845Z', iopub.status.idle: '2021-06-26T16:41:28.863352Z', shell.execute_reply: '2021-06-26T16:41:28.862576Z', shell.execute_reply.started: '2021-06-26T16:41:28.557221Z'}from sklearn.gaussian_process importGaussianProcessClassifierGaussianProcessClassifier =GaussianProcessClassifier()GaussianProcessClassifier.fit(X, y)y_pred = GaussianProcessClassifier.predict(X_test)gau_pro_accy =round(accuracy_score(y_pred, y_test), 3)print(gau_pro_accy)```## 7m. Voting Classifier<a id="voting_classifer"></a>---```{julia}#| _cell_guid: ac208dd3-1045-47bb-9512-de5ecb5c81b0#| _uuid: 821c74bbf404193219eb91fe53755d669f5a14d1#| execution: {iopub.execute_input: '2021-06-26T16:41:28.865063Z', iopub.status.busy: '2021-06-26T16:41:28.86463Z', iopub.status.idle: '2021-06-26T16:41:30.314425Z', shell.execute_reply: '2021-06-26T16:41:30.313671Z', shell.execute_reply.started: '2021-06-26T16:41:28.865013Z'}from sklearn.ensemble importVotingClassifiervoting_classifier =VotingClassifier(estimators=[ ('lr_grid', logreg_grid), ('svc', svm_grid), ('random_forest', rf_grid), ('gradient_boosting', gradient_boost), ('decision_tree_grid',dectree_grid), ('knn_classifier', knn_grid),# ('XGB_Classifier', XGBClassifier), ('bagging_classifier', bagging_grid), ('adaBoost_classifier',adaBoost_grid), ('ExtraTrees_Classifier', ExtraTreesClassifier), ('gaussian_classifier',gaussian), ('gaussian_process_classifier', GaussianProcessClassifier)],voting='hard')#voting_classifier = voting_classifier.fit(train_x,train_y)voting_classifier = voting_classifier.fit(X,y)``````{julia}#| _cell_guid: 648ac6a6-2437-490a-bf76-1612a71126e8#| _uuid: 518a02ae91cc91d618e476d1fc643cd3912ee5fb#| execution: {iopub.execute_input: '2021-06-26T16:41:30.316454Z', iopub.status.busy: '2021-06-26T16:41:30.316008Z', iopub.status.idle: '2021-06-26T16:41:30.42114Z', shell.execute_reply: '2021-06-26T16:41:30.420152Z', shell.execute_reply.started: '2021-06-26T16:41:30.31627Z'}y_pred = voting_classifier.predict(X_test)voting_accy =round(accuracy_score(y_pred, y_test), 3)print(voting_accy)``````{julia}#| _cell_guid: 277534eb-7ec8-4359-a2f4-30f7f76611b8#| _kg_hide-input: true#| _uuid: 00a9b98fd4e230db427a63596a2747f05b1654c1#| execution: {iopub.execute_input: '2021-06-26T16:41:30.422908Z', iopub.status.busy: '2021-06-26T16:41:30.422475Z', iopub.status.idle: '2021-06-26T16:41:30.426856Z', shell.execute_reply: '2021-06-26T16:41:30.425882Z', shell.execute_reply.started: '2021-06-26T16:41:30.422736Z'}#models = pd.DataFrame({# 'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',# 'Random Forest', 'Naive Bayes',# 'Decision Tree', 'Gradient Boosting Classifier', 'Voting Classifier', 'XGB Classifier','ExtraTrees Classifier','Bagging Classifier'],# 'Score': [svc_accy, knn_accy, logreg_accy,# random_accy, gaussian_accy, dectree_accy,# gradient_accy, voting_accy, XGBClassifier_accy, extraTree_accy, bagging_accy]})#models.sort_values(by='Score', ascending=False)```# Part 8: Submit test predictions<a id="submit_predictions"></a>---```{julia}#| _uuid: eb0054822f296ba86aa6005b2a5e35fbc1aec88b#| execution: {iopub.execute_input: '2021-06-26T16:41:30.429099Z', iopub.status.busy: '2021-06-26T16:41:30.42862Z', iopub.status.idle: '2021-06-26T16:41:30.646363Z', shell.execute_reply: '2021-06-26T16:41:30.645616Z', shell.execute_reply.started: '2021-06-26T16:41:30.428903Z'}all_models = [logreg_grid, knn_grid, knn_ran_grid, svm_grid, dectree_grid, rf_grid, bagging_grid, adaBoost_grid, voting_classifier]c = {}for i in all_models: a = i.predict(X_test) b =accuracy_score(a, y_test) c[i] = b``````{julia}#| _cell_guid: 51368e53-52e4-41cf-9cc9-af6164c9c6f5#| _uuid: b947f168f6655c1c6eadaf53f3485d57c0cd74c7#| execution: {iopub.execute_input: '2021-06-26T16:41:30.648318Z', iopub.status.busy: '2021-06-26T16:41:30.647987Z', iopub.status.idle: '2021-06-26T16:41:32.045557Z', shell.execute_reply: '2021-06-26T16:41:32.044733Z', shell.execute_reply.started: '2021-06-26T16:41:30.648259Z'}test_prediction = (max(c, key=c.get)).predict(test)submission = pd.DataFrame({"PassengerId": passengerid,"Survived": test_prediction })submission.PassengerId = submission.PassengerId.astype(int)submission.Survived = submission.Survived.astype(int)submission.to_csv("titanic1_submission.csv", index=False)```<div class="alert alert-info"> <h1>Resources</h1> <ul> <li><b>Statistics</b></li> <ul> <li><a href="https://statistics.laerd.com/statistical-guides/measures-of-spread-standard-deviation.php">Types of Standard Deviation</a></li> <li><a href="https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-is-a-t-test-and-why-is-it-like-telling-a-kid-to-clean-up-that-mess-in-the-kitchen">What Is a t-test? And Why Is It Like Telling a Kid to Clean Up that Mess in the Kitchen?</a></li> <li><a href="https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statistics">What Are T Values and P Values in Statistics?</a></li> <li><a href="https://www.youtube.com/watch?v=E4KCfcVwzyw">What is p-value? How we decide on our confidence level.</a></li> </ul> <li><b>Writing pythonic code</b></li> <ul> <li><a href="https://www.kaggle.com/rtatman/six-steps-to-more-professional-data-science-code">Six steps to more professional data science code</a></li> <li><a href="https://www.kaggle.com/jpmiller/creating-a-good-analytics-report">Creating a Good Analytics Report</a></li> <li><a href="https://en.wikipedia.org/wiki/Code_smell">Code Smell</a></li> <li><a href="https://www.python.org/dev/peps/pep-0008/">Python style guides</a></li> <li><a href="https://gist.github.com/sloria/7001839">The Best of the Best Practices(BOBP) Guide for Python</a></li> <li><a href="https://www.python.org/dev/peps/pep-0020/">PEP 20 -- The Zen of Python</a></li> <li><a href="https://docs.python-guide.org/">The Hitchiker's Guide to Python</a></li> <li><a href="https://realpython.com/tutorials/best-practices/">Python Best Practice Patterns</a></li> <li><a href="http://www.nilunder.com/blog/2013/08/03/pythonic-sensibilities/">Pythonic Sensibilities</a></li> </ul> <li><b>Why Scikit-Learn?</b></li> <ul> <li><a href="https://www.oreilly.com/content/intro-to-scikit-learn/">Introduction to Scikit-Learn</a></li> <li><a href="https://www.oreilly.com/content/six-reasons-why-i-recommend-scikit-learn/">Six reasons why I recommend scikit-learn</a></li> <li><a href="https://hub.packtpub.com/learn-scikit-learn/">Why you should learn Scikit-learn</a></li> <li><a href="https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines">A Deep Dive Into Sklearn Pipelines</a></li> <li><a href="https://www.kaggle.com/sermakarevich/sklearn-pipelines-tutorial">Sklearn pipelines tutorial</a></li> <li><a href="https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html">Managing Machine Learning workflows with Sklearn pipelines</a></li> <li><a href="https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976">A simple example of pipeline in Machine Learning using SKlearn</a></li> </ul> </ul> <h1>Credits</h1> <ul> <li>To Brandon Foltz for his <a href="https://www.youtube.com/channel/UCFrjdcImgcQVyFbK04MBEhA">youtube</a> channel and for being an amazing teacher.</li> <li>To GA where I started my data science journey.</li> <li>To Kaggle community for inspiring me over and over again with all the resources I need.</li> <li>To Udemy Course "Deployment of Machine Learning". I have used and modified some of the code from this course to help making the learning process intuitive.</li> </ul></div><div class="alert alert-info"><h4>If you like to discuss any other projects or just have a chat about data science topics, I'll be more than happy to connect with you on:</h4> <ul> <li><a href="https://www.linkedin.com/in/masumrumi/"><b>LinkedIn</b></a></li> <li><a href="https://github.com/masumrumi"><b>Github</b></a></li> <li><a href="https://masumrumi.github.io/cv/"><b>masumrumi.github.io/cv/</b></a></li> <li><a href="https://www.youtube.com/channel/UC1mPjGyLcZmsMgZ8SJgrfdw"><b>Youtube</b></a></li> </ul><p>This kernel will always be a work in progress. I will incorporate new concepts of data science as I comprehend them with each update. If you have any idea/suggestions about this notebook, please let me know. Any feedback about further improvements would be genuinely appreciated.</p><h1>If you have come this far, Congratulations!!</h1><h1>If this notebook helped you in any way or you liked it, please upvote and/or leave a comment!! :)</h1></div>